beijing opera synthesis based on straight algorithm and ... · advancesinmultimedia target voice (a...

15
Research Article Beijing Opera Synthesis Based on Straight Algorithm and Deep Learning XueTing Wang , 1 Cong Jin , 2 and Wei Zhao 1 1 College of Science and Technology, Communication University of China, Beijing, China 2 Key Laboratory of Media Audio & Video, Communication University of China, Beijing, China Correspondence should be addressed to Cong Jin; [email protected] Received 3 April 2018; Accepted 20 May 2018; Published 17 July 2018 Academic Editor: Yong Luo Copyright © 2018 XueTing Wang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Speech synthesis is an important research content in the field of human-computer interaction and has a wide range of applications. As one of its branches, singing synthesis plays an important role. Beijing Opera is a famous traditional Chinese opera, and it is called Chinese quintessence. e singing of Beijing Opera carries some features of speech but it has its own unique pronunciation rules and rhythms which differ from ordinary speech and singing. In this paper, we propose three models for the synthesis of Beijing Opera. Firstly, the speech signals of the source speaker and the target speaker are extracted by using the straight algorithm. And then through the training of GMM, we complete the voice control model to input the voice to be converted and output the voice aſter the voice conversion. Finally, by modeling the fundamental frequency, duration, and frequency separately, a melodic control model is constructed using GAN to realize the synthesis of the Beijing Opera fragment. We connect the fragments and superimpose the background music to achieve the synthesis of Beijing Opera. e experimental results show that the synthesized Beijing Opera has some audibility and can basically complete the composition of Beijing Opera. We also extend our models to human-AI cooperative music generation: given a target voice of human, we can generate a Beijing Opera which is sung by a new target voice. 1. Introduction With the development of the times and the continuous innovation of science and technology, the demand for speech synthesis [1] is no longer simple to speak but can accomplish special voices such as singing and poetry. It is undoubtedly ingenious and novel to apply the method of singing synthesis [2] to Beijing Opera. Known as the quintessence of Chinese culture, Beijing Opera is one of the most famous traditional operas in China. And since its birth at the end of the 18th century, it has been favored by Chinese people and the people of other countries in East Asia. Beijing Opera has a long history and rich cultural connotation. In addition to the exquisite stage performances and vivid story plots, the music and singing of Beijing Opera are of great artistic value. In particular, it is a unique style of singing, which shows the extraordinary creativity of the Chinese nation, being the embodiment of the traditional artists’ superb skills. It makes sense to use the straight algorithm, GMM, and GAN to synthesize Beijing Opera. e synthesis of Beijing Opera can consist of three steps in Figure 1. First is voice conversion by using the straight algorithm, and then the synthesis of Beijing Opera fragments can be achieved through the tone control model and the melody control model. Finally, we connect the fragments and superimpose the background music to achieve the synthesis of Beijing Opera. 2. Synthesis of Beijing Opera with Straight Algorithm 2.1. Phoneme 2.1.1. Phoneme Profile. e phoneme is the smallest unit of speech or the smallest piece of speech that constitutes a Hindawi Advances in Multimedia Volume 2018, Article ID 5158164, 14 pages https://doi.org/10.1155/2018/5158164

Upload: others

Post on 22-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Research ArticleBeijing Opera Synthesis Based on StraightAlgorithm and Deep Learning

XueTingWang 1 Cong Jin 2 andWei Zhao1

1College of Science and Technology Communication University of China Beijing China2Key Laboratory of Media Audio amp Video Communication University of China Beijing China

Correspondence should be addressed to Cong Jin jincong0623cuceducn

Received 3 April 2018 Accepted 20 May 2018 Published 17 July 2018

Academic Editor Yong Luo

Copyright copy 2018 XueTingWang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Speech synthesis is an important research content in the field of human-computer interaction and has a wide range of applicationsAs one of its branches singing synthesis plays an important role BeijingOpera is a famous traditional Chinese opera and it is calledChinese quintessence The singing of Beijing Opera carries some features of speech but it has its own unique pronunciation rulesand rhythms which differ from ordinary speech and singing In this paper we propose three models for the synthesis of BeijingOpera Firstly the speech signals of the source speaker and the target speaker are extracted by using the straight algorithm And thenthrough the training of GMM we complete the voice control model to input the voice to be converted and output the voice afterthe voice conversion Finally by modeling the fundamental frequency duration and frequency separately a melodic control modelis constructed using GAN to realize the synthesis of the Beijing Opera fragment We connect the fragments and superimpose thebackground music to achieve the synthesis of Beijing Opera The experimental results show that the synthesized Beijing Opera hassome audibility and can basically complete the composition of Beijing Opera We also extend our models to human-AI cooperativemusic generation given a target voice of human we can generate a Beijing Opera which is sung by a new target voice

1 Introduction

With the development of the times and the continuousinnovation of science and technology the demand for speechsynthesis [1] is no longer simple to speak but can accomplishspecial voices such as singing and poetry It is undoubtedlyingenious and novel to apply the method of singing synthesis[2] to Beijing Opera Known as the quintessence of Chineseculture Beijing Opera is one of the most famous traditionaloperas in China And since its birth at the end of the 18thcentury it has been favored by Chinese people and the peopleof other countries in East Asia Beijing Opera has a longhistory and rich cultural connotation In addition to theexquisite stage performances and vivid story plots the musicand singing of Beijing Opera are of great artistic value Inparticular it is a unique style of singing which shows theextraordinary creativity of the Chinese nation being theembodiment of the traditional artistsrsquo superb skills It makes

sense to use the straight algorithm GMM and GAN tosynthesize Beijing Opera

The synthesis of Beijing Opera can consist of three stepsin Figure 1 First is voice conversion by using the straightalgorithm and then the synthesis of Beijing Opera fragmentscan be achieved through the tone control model and themelody control model Finally we connect the fragments andsuperimpose the background music to achieve the synthesisof Beijing Opera

2 Synthesis of Beijing Operawith Straight Algorithm

21 Phoneme

211 Phoneme Profile The phoneme is the smallest unit ofspeech or the smallest piece of speech that constitutes a

HindawiAdvances in MultimediaVolume 2018 Article ID 5158164 14 pageshttpsdoiorg10115520185158164

2 Advances in Multimedia

Target voice(A tone A content)

New target voice(A tone B content)

Beijing Operafragment

Source voice(B tone B content)

Beijing Opera

Tone control model

Melody controlmodel

Fragment of the connectionsuperimposed background music

Figure 1 Beijing Opera synthesis

05 10 15 20 25

1

minus1

0

spee

ch

500 1000 1500 2000 2500 3000

40

0

20ener

gy

500 1000 1500 2000 2500 3000

50

ZCR

lowast105

Figure 2 Time-domain waveforms energy graphs and zero-crossing rate graphs

syllable and is the smallest linear speech unit that is dividedfrom the perspective of sound quality From the acousticproperties phonemes are the smallest units of speech dividedfrom the sound quality point of view From a physiologicalpoint of view a phonetic movement forms a phonemePhonemes are divided into vowels and consonants twocategoriesTheir classification is based onwhether the airflowis obstructed by the various organs when the sound is emittedby humansThe unhindered factor is called the vowel and theobstructed one is called the consonant

212 Phonemes Segmentation Because the same phonemeshave the same characteristics and different factors and theircombinations have different characteristics we can divideeach factorThe time-domain waveforms energy graphs andzero-crossing rate graphs of ldquojiao Zhang Sheng yin cangzai qi pan zhi xiardquo in ldquoMatchmakerrsquos Wallrdquo sung by BeijingOpera were showed in Figure 2 From this we can see that

the consonant phonemes of the initial consonants are moreirregular and the consonants formed by them have a periodicwaveform The former has the characteristics of large zero-crossing rate and low energy characteristics the latter mostof the energy is larger In addition if silence appears both aresmall (red line is the beginning and end of a word)

22 Selection and Method of Characteristic Parameters

221 Choice of Personality Characteristics Whether it is Bei-jing Opera or general voice the speakerrsquos personal habits andpronunciation styles are different on the one hand and thespeakerrsquos position on the other (or the role of different actorsin Beijing Opera) will result in each person having a handleon each phoneme a little difference Generally speaking theparameters that characterize the speakerrsquos personality are thefeatures of the syllabic the suprasegmental and the linguistic[3 4]

Syllabic features they describe the tonal characteristicsof speech The characteristic parameters mainly include theposition of the formant the bandwidth of the formant thespectrum tilt the pitch frequency and the energy Segmentfeatures are mainly related to the physiological and phoneticfeatures of vocal organs and also to the speakerrsquos emotionalstateThe features used in the tone control model in Section 3are mainly for this reason

Supersonic characteristics they mainly refer to the wayof speaking such as the duration of phonemes pitch andstress what people feel is the rate of speech pitch and volumechanges The features used in the melodic control model inSection 4 are mainly for this reason

Language features for example idioms dialects accentand so on

However Beijing Opera and voice are different in theirpurpose of pronunciation and expressionThe pitch and pitchlength of each word in Beijing Opera are controlled by thescore in addition to its own pronunciation The ordinaryspeech is mainly used to express the content of the speechbut Beijing Opera is more emotionally expressed by melodyThrough the description of the above characteristics themain considerations of this test sound quality mapping of theresearch factors are as follows

Pitch it is determined by the vibration frequency of thesource in a period of timeThehigher the vibration frequencythe higher the sound and the lower the converse BeijingOperarsquos pitch and character roles such as LaoSheng arerelatively low Dan is relatively high

Pitch length the length of the sound is determined bythe duration of the sound source vibration The longer theduration the longer the sound and the shorter the other handThe average length of Beijing Opera per word is relativelylong and its variation range is relatively large

Sound intensity the strength of the sound depends onthe vibration amplitude of the sound source the greaterthe amplitude the stronger the sound on the other handthe lower the amplitude the smaller the sound Since theamplitude of Beijing Opera is controlled by strong emotionit is larger than the voice range In general the voice has onlya relatively small amplitude range of uniform distribution

Advances in Multimedia 3

Table 1 The correlations of subjective and objective amount of speech

Objective amount Subjective amountpitch volume tone duration

fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++

Relevance is positively related to the number of lsquo+rsquo s

Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres

By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1

Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic

222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows

(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as

119867(z) = 1 minus 120583zminus1 (1)

The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant

(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms

(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as

119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)

Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046

s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)

(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as

119883a (119896) = 119873minus1sum119899=0

119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)

where x (n) is the input speech signal andN is the numberof points in the Fourier transform

(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3

The frequency response of the triangular filter is definedas

4 Advances in Multimedia

119867m (119896) =

0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)

0 119896 ge 119891 (119898 + 1)(5)

(6)Calculate logarithmic energy output from each filter bankas

119904 (119898) = ln[119873minus1sum119896=0

10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)

(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as

119862 (119899) = 119873minus1sum119898=0

119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871

(7)

The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters

(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters

(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas

dt =

119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870

119896=1 1198962 others

119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)

where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter

23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods

231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)

232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as

119883n (119890119895120596) = +infinsum119898=minusinfin

119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)

The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics

(1) High frequency resolution the main lobe is narrowand sharp

(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow

However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the

Advances in Multimedia 5

f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)

(1(E)(2(E)

(3(E)(4(E)

(5(E)(6(E)

Figure 3 Mel frequency filter bank

0 1 2 3 4 5 6

060402

0minus02minus04

T I M E (s)

Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram

frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum

233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5

24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram

parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6

First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice

y (t) = sum119905119894isin119876

1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)

vt119894 119879(t119894) is shown as

V119905119894 (119905) = 1radic2120587 int+infin

minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)

119879 (119905119894) = sum119905119896isin119876119896lt119894

1radic119866 (1198910 (119905119896)) (12)

In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)

119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)

ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)

(14)

119888119905 (119902) = 1radic2120587sdot int+infin

minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)

q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

2 Advances in Multimedia

Target voice(A tone A content)

New target voice(A tone B content)

Beijing Operafragment

Source voice(B tone B content)

Beijing Opera

Tone control model

Melody controlmodel

Fragment of the connectionsuperimposed background music

Figure 1 Beijing Opera synthesis

05 10 15 20 25

1

minus1

0

spee

ch

500 1000 1500 2000 2500 3000

40

0

20ener

gy

500 1000 1500 2000 2500 3000

50

ZCR

lowast105

Figure 2 Time-domain waveforms energy graphs and zero-crossing rate graphs

syllable and is the smallest linear speech unit that is dividedfrom the perspective of sound quality From the acousticproperties phonemes are the smallest units of speech dividedfrom the sound quality point of view From a physiologicalpoint of view a phonetic movement forms a phonemePhonemes are divided into vowels and consonants twocategoriesTheir classification is based onwhether the airflowis obstructed by the various organs when the sound is emittedby humansThe unhindered factor is called the vowel and theobstructed one is called the consonant

212 Phonemes Segmentation Because the same phonemeshave the same characteristics and different factors and theircombinations have different characteristics we can divideeach factorThe time-domain waveforms energy graphs andzero-crossing rate graphs of ldquojiao Zhang Sheng yin cangzai qi pan zhi xiardquo in ldquoMatchmakerrsquos Wallrdquo sung by BeijingOpera were showed in Figure 2 From this we can see that

the consonant phonemes of the initial consonants are moreirregular and the consonants formed by them have a periodicwaveform The former has the characteristics of large zero-crossing rate and low energy characteristics the latter mostof the energy is larger In addition if silence appears both aresmall (red line is the beginning and end of a word)

22 Selection and Method of Characteristic Parameters

221 Choice of Personality Characteristics Whether it is Bei-jing Opera or general voice the speakerrsquos personal habits andpronunciation styles are different on the one hand and thespeakerrsquos position on the other (or the role of different actorsin Beijing Opera) will result in each person having a handleon each phoneme a little difference Generally speaking theparameters that characterize the speakerrsquos personality are thefeatures of the syllabic the suprasegmental and the linguistic[3 4]

Syllabic features they describe the tonal characteristicsof speech The characteristic parameters mainly include theposition of the formant the bandwidth of the formant thespectrum tilt the pitch frequency and the energy Segmentfeatures are mainly related to the physiological and phoneticfeatures of vocal organs and also to the speakerrsquos emotionalstateThe features used in the tone control model in Section 3are mainly for this reason

Supersonic characteristics they mainly refer to the wayof speaking such as the duration of phonemes pitch andstress what people feel is the rate of speech pitch and volumechanges The features used in the melodic control model inSection 4 are mainly for this reason

Language features for example idioms dialects accentand so on

However Beijing Opera and voice are different in theirpurpose of pronunciation and expressionThe pitch and pitchlength of each word in Beijing Opera are controlled by thescore in addition to its own pronunciation The ordinaryspeech is mainly used to express the content of the speechbut Beijing Opera is more emotionally expressed by melodyThrough the description of the above characteristics themain considerations of this test sound quality mapping of theresearch factors are as follows

Pitch it is determined by the vibration frequency of thesource in a period of timeThehigher the vibration frequencythe higher the sound and the lower the converse BeijingOperarsquos pitch and character roles such as LaoSheng arerelatively low Dan is relatively high

Pitch length the length of the sound is determined bythe duration of the sound source vibration The longer theduration the longer the sound and the shorter the other handThe average length of Beijing Opera per word is relativelylong and its variation range is relatively large

Sound intensity the strength of the sound depends onthe vibration amplitude of the sound source the greaterthe amplitude the stronger the sound on the other handthe lower the amplitude the smaller the sound Since theamplitude of Beijing Opera is controlled by strong emotionit is larger than the voice range In general the voice has onlya relatively small amplitude range of uniform distribution

Advances in Multimedia 3

Table 1 The correlations of subjective and objective amount of speech

Objective amount Subjective amountpitch volume tone duration

fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++

Relevance is positively related to the number of lsquo+rsquo s

Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres

By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1

Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic

222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows

(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as

119867(z) = 1 minus 120583zminus1 (1)

The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant

(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms

(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as

119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)

Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046

s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)

(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as

119883a (119896) = 119873minus1sum119899=0

119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)

where x (n) is the input speech signal andN is the numberof points in the Fourier transform

(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3

The frequency response of the triangular filter is definedas

4 Advances in Multimedia

119867m (119896) =

0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)

0 119896 ge 119891 (119898 + 1)(5)

(6)Calculate logarithmic energy output from each filter bankas

119904 (119898) = ln[119873minus1sum119896=0

10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)

(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as

119862 (119899) = 119873minus1sum119898=0

119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871

(7)

The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters

(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters

(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas

dt =

119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870

119896=1 1198962 others

119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)

where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter

23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods

231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)

232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as

119883n (119890119895120596) = +infinsum119898=minusinfin

119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)

The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics

(1) High frequency resolution the main lobe is narrowand sharp

(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow

However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the

Advances in Multimedia 5

f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)

(1(E)(2(E)

(3(E)(4(E)

(5(E)(6(E)

Figure 3 Mel frequency filter bank

0 1 2 3 4 5 6

060402

0minus02minus04

T I M E (s)

Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram

frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum

233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5

24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram

parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6

First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice

y (t) = sum119905119894isin119876

1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)

vt119894 119879(t119894) is shown as

V119905119894 (119905) = 1radic2120587 int+infin

minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)

119879 (119905119894) = sum119905119896isin119876119896lt119894

1radic119866 (1198910 (119905119896)) (12)

In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)

119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)

ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)

(14)

119888119905 (119902) = 1radic2120587sdot int+infin

minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)

q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 3

Table 1 The correlations of subjective and objective amount of speech

Objective amount Subjective amountpitch volume tone duration

fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++

Relevance is positively related to the number of lsquo+rsquo s

Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres

By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1

Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic

222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows

(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as

119867(z) = 1 minus 120583zminus1 (1)

The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant

(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms

(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as

119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)

Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046

s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)

(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as

119883a (119896) = 119873minus1sum119899=0

119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)

where x (n) is the input speech signal andN is the numberof points in the Fourier transform

(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3

The frequency response of the triangular filter is definedas

4 Advances in Multimedia

119867m (119896) =

0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)

0 119896 ge 119891 (119898 + 1)(5)

(6)Calculate logarithmic energy output from each filter bankas

119904 (119898) = ln[119873minus1sum119896=0

10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)

(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as

119862 (119899) = 119873minus1sum119898=0

119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871

(7)

The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters

(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters

(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas

dt =

119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870

119896=1 1198962 others

119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)

where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter

23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods

231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)

232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as

119883n (119890119895120596) = +infinsum119898=minusinfin

119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)

The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics

(1) High frequency resolution the main lobe is narrowand sharp

(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow

However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the

Advances in Multimedia 5

f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)

(1(E)(2(E)

(3(E)(4(E)

(5(E)(6(E)

Figure 3 Mel frequency filter bank

0 1 2 3 4 5 6

060402

0minus02minus04

T I M E (s)

Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram

frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum

233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5

24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram

parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6

First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice

y (t) = sum119905119894isin119876

1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)

vt119894 119879(t119894) is shown as

V119905119894 (119905) = 1radic2120587 int+infin

minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)

119879 (119905119894) = sum119905119896isin119876119896lt119894

1radic119866 (1198910 (119905119896)) (12)

In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)

119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)

ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)

(14)

119888119905 (119902) = 1radic2120587sdot int+infin

minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)

q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

4 Advances in Multimedia

119867m (119896) =

0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)

0 119896 ge 119891 (119898 + 1)(5)

(6)Calculate logarithmic energy output from each filter bankas

119904 (119898) = ln[119873minus1sum119896=0

10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)

(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as

119862 (119899) = 119873minus1sum119898=0

119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871

(7)

The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters

(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters

(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas

dt =

119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870

119896=1 1198962 others

119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)

where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter

23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods

231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)

232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as

119883n (119890119895120596) = +infinsum119898=minusinfin

119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)

The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics

(1) High frequency resolution the main lobe is narrowand sharp

(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow

However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the

Advances in Multimedia 5

f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)

(1(E)(2(E)

(3(E)(4(E)

(5(E)(6(E)

Figure 3 Mel frequency filter bank

0 1 2 3 4 5 6

060402

0minus02minus04

T I M E (s)

Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram

frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum

233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5

24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram

parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6

First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice

y (t) = sum119905119894isin119876

1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)

vt119894 119879(t119894) is shown as

V119905119894 (119905) = 1radic2120587 int+infin

minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)

119879 (119905119894) = sum119905119896isin119876119896lt119894

1radic119866 (1198910 (119905119896)) (12)

In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)

119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)

ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)

(14)

119888119905 (119902) = 1radic2120587sdot int+infin

minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)

q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 5

f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)

(1(E)(2(E)

(3(E)(4(E)

(5(E)(6(E)

Figure 3 Mel frequency filter bank

0 1 2 3 4 5 6

060402

0minus02minus04

T I M E (s)

Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram

frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum

233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5

24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram

parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6

First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice

y (t) = sum119905119894isin119876

1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)

vt119894 119879(t119894) is shown as

V119905119894 (119905) = 1radic2120587 int+infin

minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)

119879 (119905119894) = sum119905119896isin119876119896lt119894

1radic119866 (1198910 (119905119896)) (12)

In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)

119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)

ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)

(14)

119888119905 (119902) = 1radic2120587sdot int+infin

minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)

q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

6 Advances in Multimedia

F0 Fundamental extraction Sound source

Parameter

adjustmentSpectral envelope extraction

Output synthetic speech

Voice parameters

input voice Time-varying filter

Figure 6 Straight synthesis system

synthesized speech signal is almost indistinguishable fromthe original signal

3 Tone Control Model

Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo

31 The Fundamental Frequency andChannel Spectrum Extraction

311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency

Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows

Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by

the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as

119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin

minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)

gAG(t) is (17) and shown as (18)

g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)

119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)

Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter

Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as

119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int

Ω|119863|2 119889119906]

minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910

+ logΩ(1205910)(19)

The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 7

However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)

119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]

+ log [intΩ|119863|2 119889119906]

minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]

+ 2 log 1205910 + logΩ(1205910)

(20)

120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)

120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)

Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)

1198910 = 1205960 (119905)2120587 (23)

120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)

y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)

312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result

The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as

119904 (119905) = 119901 (119905) lowast V (119905) (26)

Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)

119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)

119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)

The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the

frequency domain respectively The short-time spectrumwindow function used is (29) and (30)

119908 (119905) = 11198910 119890minus120587(1199051198910)2

(29)

119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)

However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation

Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)

119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)

ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)

119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)

Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as

10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)

Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution

Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as

1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)

Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)

32 GMM Achieve Parameter Conversion

321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in

119875(119883120582 ) = 119872sum119894=1

120596119894119887119894 (119883) (36)

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

8 Advances in Multimedia

where X is a random vector of n dimensions 120596i is amixture weight sum119872

i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as

119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890

minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)

where 120583i is the mean vector and sumi is the covariancematrix

Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time

322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics

(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)

1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)

(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as

119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1

119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum

119894)minus1 (119883 minus 120583119883119894 )]]

(39)

119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)

120583i = [120583119883i120583119884i ]

sum119894

= [[[[[[

119883119883sum119894

119883119884sum119894

119884119883sum119894

119884119884sum119894

]]]]]]

119894 = 1 119872

(41)

120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884

119894 is the

jiao Zhang Sheng yin cang zai qi pan zhi xia

0 2 3 4 5 6 71

045

040

035

030

025

020

015

010

005

0

minus005

Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope

variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884

119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM

4 Melody Control Model

The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions

41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7

From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8

42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 9

The fundamental frequency extractedfrom MIDI

Vibratoprocessing

Gaussian white noise High pass filter

Output of basic frequency curve

Figure 8 The control design of the fundamental frequency

Table 2 Duration parameters

Duration parametersBefore modification After modification

dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b

dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration

also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2

Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score

The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here

43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice

44 GAN Model

441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image

and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music

442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing

First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music

Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics

Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset

443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10

The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal

The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

10 Advances in Multimedia

Mergingaudio tracks

datasets

Beijing Operasoundtrack

Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals

Beijing Operasoundtrack

Combineddatasets

Screeningmusic

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Datacleaning

Only select thefollowing sectionbull Best matching

confidencebull The soundtracks

of Beijing operalyrics

Training datasets

Figure 9 Illustration of the dataset preparation and data preprocessing procedure

z~p(z) G )

D realfake

G(z)

X

fake data

real data

random noise Generator

Discriminatorcritic

(wgan-gp)

4-bar phrases of 5 tracks

Figure 10 GAN structure diagram

the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized

The training process can be modeled as a simpleMinMaxproblem in

minG

max119863

119863 (119909) minus 119863 (119866 (119911)) (42)

The MinMax optimization formula is defined as follows

minqG

max119902119863

119881 (119863119866) = min119866

max119863

119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)

The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN

The training and testing process of the GAN generatedmusic dataset is as in Figure 11

The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves

5 Experiment

51 Tone Control Model

511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process

512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15

They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 11

z

z

G

Gz GzGz Gz

zzzz

z

z

Gz

z

Gz

z

Gz

z

G

Bar Generator

Chords

Style

Chords

Groove

Figure 11 Raining and testing process of the GAN

fundamentalfrequency F0

Spectralenvelope

Source voice

STRAIGHT analysis

fundamentalfrequency F0

Spectralenvelope

TimeAlign-ment

GMM training to establishmapping rules

Single Gaussianmodel method

Calculate the meanand variance

Tone conversion

STRAIGHT synthesis

ConvertedMFCC

Convertedfundamentalfrequency F0

DTW

Source voice

STRAIGHT analysis

MFCC parameter conversion

To be converted voices

STRAIGHT analysis

MFCC parameter conversion

Conversionphase

Training phase

fundamentalfrequency F0

Spectralenvelope

Figure 12 Tone control model

52 Melody Control Model

521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the

duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera

So the melody control model can be summarized inFigure 16

522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

12 Advances in Multimedia

Table 3 MOS grading

MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion

Table 4 Experimental results

Experimental results

ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3

Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5

0100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035

Figure 13 Source speech spectrogram

on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem

TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor

Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4

53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody

control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment

Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera

Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5

According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera

6 Conclusion

In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

Advances in Multimedia 13

Table 5 Rating results

MOS Score

score studentsstudent1 student2 student3 student4 student5

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3

score studentsstudent6 student7 student8 student9 student10

Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

05 302010 15 25 4035 45

Figure 14 Target speech spectrogram

0

100020003000400050006000

Spec

trogr

amfre

quen

cy (H

Z)

3020 4515 25 403505 10

Figure 15 Converted speech spectrogram

Syllable fundamentalfrequency

Spectrum Envelope

Time length control modelvoice Feature extraction

Syllable duration

Note fundamental frequency

Length of note

Spectrum control model

Time length control model

F0 control model

synthesisBeijing Opera

MIDI

Figure 16 Melody control model

truly able to compose the Beijing Opera singing art workswith higher quality

Data Availability

The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at

httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp

Conflicts of Interest

The authors declare that they have no conflicts of interest

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

14 Advances in Multimedia

Acknowledgments

This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857

References

[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007

[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013

[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014

[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008

[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947

[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017

[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007

[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997

[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008

[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002

[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009

[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018

[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014

[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434

[15] Interpretable representation learning by information maximiz-ing generative adversarial nets

[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304

[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: Beijing Opera Synthesis Based on Straight Algorithm and ... · AdvancesinMultimedia Target voice (A tone A content) New target voice (A tone B content) Beijing Opera fragment Source

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom