beijing opera synthesis based on straight algorithm and ... · advancesinmultimedia target voice (a...
TRANSCRIPT
Research ArticleBeijing Opera Synthesis Based on StraightAlgorithm and Deep Learning
XueTingWang 1 Cong Jin 2 andWei Zhao1
1College of Science and Technology Communication University of China Beijing China2Key Laboratory of Media Audio amp Video Communication University of China Beijing China
Correspondence should be addressed to Cong Jin jincong0623cuceducn
Received 3 April 2018 Accepted 20 May 2018 Published 17 July 2018
Academic Editor Yong Luo
Copyright copy 2018 XueTingWang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Speech synthesis is an important research content in the field of human-computer interaction and has a wide range of applicationsAs one of its branches singing synthesis plays an important role BeijingOpera is a famous traditional Chinese opera and it is calledChinese quintessence The singing of Beijing Opera carries some features of speech but it has its own unique pronunciation rulesand rhythms which differ from ordinary speech and singing In this paper we propose three models for the synthesis of BeijingOpera Firstly the speech signals of the source speaker and the target speaker are extracted by using the straight algorithm And thenthrough the training of GMM we complete the voice control model to input the voice to be converted and output the voice afterthe voice conversion Finally by modeling the fundamental frequency duration and frequency separately a melodic control modelis constructed using GAN to realize the synthesis of the Beijing Opera fragment We connect the fragments and superimpose thebackground music to achieve the synthesis of Beijing Opera The experimental results show that the synthesized Beijing Opera hassome audibility and can basically complete the composition of Beijing Opera We also extend our models to human-AI cooperativemusic generation given a target voice of human we can generate a Beijing Opera which is sung by a new target voice
1 Introduction
With the development of the times and the continuousinnovation of science and technology the demand for speechsynthesis [1] is no longer simple to speak but can accomplishspecial voices such as singing and poetry It is undoubtedlyingenious and novel to apply the method of singing synthesis[2] to Beijing Opera Known as the quintessence of Chineseculture Beijing Opera is one of the most famous traditionaloperas in China And since its birth at the end of the 18thcentury it has been favored by Chinese people and the peopleof other countries in East Asia Beijing Opera has a longhistory and rich cultural connotation In addition to theexquisite stage performances and vivid story plots the musicand singing of Beijing Opera are of great artistic value Inparticular it is a unique style of singing which shows theextraordinary creativity of the Chinese nation being theembodiment of the traditional artistsrsquo superb skills It makes
sense to use the straight algorithm GMM and GAN tosynthesize Beijing Opera
The synthesis of Beijing Opera can consist of three stepsin Figure 1 First is voice conversion by using the straightalgorithm and then the synthesis of Beijing Opera fragmentscan be achieved through the tone control model and themelody control model Finally we connect the fragments andsuperimpose the background music to achieve the synthesisof Beijing Opera
2 Synthesis of Beijing Operawith Straight Algorithm
21 Phoneme
211 Phoneme Profile The phoneme is the smallest unit ofspeech or the smallest piece of speech that constitutes a
HindawiAdvances in MultimediaVolume 2018 Article ID 5158164 14 pageshttpsdoiorg10115520185158164
2 Advances in Multimedia
Target voice(A tone A content)
New target voice(A tone B content)
Beijing Operafragment
Source voice(B tone B content)
Beijing Opera
Tone control model
Melody controlmodel
Fragment of the connectionsuperimposed background music
Figure 1 Beijing Opera synthesis
05 10 15 20 25
1
minus1
0
spee
ch
500 1000 1500 2000 2500 3000
40
0
20ener
gy
500 1000 1500 2000 2500 3000
50
ZCR
lowast105
Figure 2 Time-domain waveforms energy graphs and zero-crossing rate graphs
syllable and is the smallest linear speech unit that is dividedfrom the perspective of sound quality From the acousticproperties phonemes are the smallest units of speech dividedfrom the sound quality point of view From a physiologicalpoint of view a phonetic movement forms a phonemePhonemes are divided into vowels and consonants twocategoriesTheir classification is based onwhether the airflowis obstructed by the various organs when the sound is emittedby humansThe unhindered factor is called the vowel and theobstructed one is called the consonant
212 Phonemes Segmentation Because the same phonemeshave the same characteristics and different factors and theircombinations have different characteristics we can divideeach factorThe time-domain waveforms energy graphs andzero-crossing rate graphs of ldquojiao Zhang Sheng yin cangzai qi pan zhi xiardquo in ldquoMatchmakerrsquos Wallrdquo sung by BeijingOpera were showed in Figure 2 From this we can see that
the consonant phonemes of the initial consonants are moreirregular and the consonants formed by them have a periodicwaveform The former has the characteristics of large zero-crossing rate and low energy characteristics the latter mostof the energy is larger In addition if silence appears both aresmall (red line is the beginning and end of a word)
22 Selection and Method of Characteristic Parameters
221 Choice of Personality Characteristics Whether it is Bei-jing Opera or general voice the speakerrsquos personal habits andpronunciation styles are different on the one hand and thespeakerrsquos position on the other (or the role of different actorsin Beijing Opera) will result in each person having a handleon each phoneme a little difference Generally speaking theparameters that characterize the speakerrsquos personality are thefeatures of the syllabic the suprasegmental and the linguistic[3 4]
Syllabic features they describe the tonal characteristicsof speech The characteristic parameters mainly include theposition of the formant the bandwidth of the formant thespectrum tilt the pitch frequency and the energy Segmentfeatures are mainly related to the physiological and phoneticfeatures of vocal organs and also to the speakerrsquos emotionalstateThe features used in the tone control model in Section 3are mainly for this reason
Supersonic characteristics they mainly refer to the wayof speaking such as the duration of phonemes pitch andstress what people feel is the rate of speech pitch and volumechanges The features used in the melodic control model inSection 4 are mainly for this reason
Language features for example idioms dialects accentand so on
However Beijing Opera and voice are different in theirpurpose of pronunciation and expressionThe pitch and pitchlength of each word in Beijing Opera are controlled by thescore in addition to its own pronunciation The ordinaryspeech is mainly used to express the content of the speechbut Beijing Opera is more emotionally expressed by melodyThrough the description of the above characteristics themain considerations of this test sound quality mapping of theresearch factors are as follows
Pitch it is determined by the vibration frequency of thesource in a period of timeThehigher the vibration frequencythe higher the sound and the lower the converse BeijingOperarsquos pitch and character roles such as LaoSheng arerelatively low Dan is relatively high
Pitch length the length of the sound is determined bythe duration of the sound source vibration The longer theduration the longer the sound and the shorter the other handThe average length of Beijing Opera per word is relativelylong and its variation range is relatively large
Sound intensity the strength of the sound depends onthe vibration amplitude of the sound source the greaterthe amplitude the stronger the sound on the other handthe lower the amplitude the smaller the sound Since theamplitude of Beijing Opera is controlled by strong emotionit is larger than the voice range In general the voice has onlya relatively small amplitude range of uniform distribution
Advances in Multimedia 3
Table 1 The correlations of subjective and objective amount of speech
Objective amount Subjective amountpitch volume tone duration
fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++
Relevance is positively related to the number of lsquo+rsquo s
Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres
By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1
Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic
222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows
(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as
119867(z) = 1 minus 120583zminus1 (1)
The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant
(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms
(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as
119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)
Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046
s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)
(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as
119883a (119896) = 119873minus1sum119899=0
119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)
where x (n) is the input speech signal andN is the numberof points in the Fourier transform
(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3
The frequency response of the triangular filter is definedas
4 Advances in Multimedia
119867m (119896) =
0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)
0 119896 ge 119891 (119898 + 1)(5)
(6)Calculate logarithmic energy output from each filter bankas
119904 (119898) = ln[119873minus1sum119896=0
10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)
(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as
119862 (119899) = 119873minus1sum119898=0
119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871
(7)
The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters
(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters
(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas
dt =
119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870
119896=1 1198962 others
119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)
where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter
23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods
231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)
232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as
119883n (119890119895120596) = +infinsum119898=minusinfin
119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)
The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics
(1) High frequency resolution the main lobe is narrowand sharp
(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow
However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the
Advances in Multimedia 5
f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)
(1(E)(2(E)
(3(E)(4(E)
(5(E)(6(E)
Figure 3 Mel frequency filter bank
0 1 2 3 4 5 6
060402
0minus02minus04
T I M E (s)
Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram
frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum
233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5
24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram
parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6
First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice
y (t) = sum119905119894isin119876
1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)
vt119894 119879(t119894) is shown as
V119905119894 (119905) = 1radic2120587 int+infin
minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)
119879 (119905119894) = sum119905119896isin119876119896lt119894
1radic119866 (1198910 (119905119896)) (12)
In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)
119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)
ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)
(14)
119888119905 (119902) = 1radic2120587sdot int+infin
minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)
q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
2 Advances in Multimedia
Target voice(A tone A content)
New target voice(A tone B content)
Beijing Operafragment
Source voice(B tone B content)
Beijing Opera
Tone control model
Melody controlmodel
Fragment of the connectionsuperimposed background music
Figure 1 Beijing Opera synthesis
05 10 15 20 25
1
minus1
0
spee
ch
500 1000 1500 2000 2500 3000
40
0
20ener
gy
500 1000 1500 2000 2500 3000
50
ZCR
lowast105
Figure 2 Time-domain waveforms energy graphs and zero-crossing rate graphs
syllable and is the smallest linear speech unit that is dividedfrom the perspective of sound quality From the acousticproperties phonemes are the smallest units of speech dividedfrom the sound quality point of view From a physiologicalpoint of view a phonetic movement forms a phonemePhonemes are divided into vowels and consonants twocategoriesTheir classification is based onwhether the airflowis obstructed by the various organs when the sound is emittedby humansThe unhindered factor is called the vowel and theobstructed one is called the consonant
212 Phonemes Segmentation Because the same phonemeshave the same characteristics and different factors and theircombinations have different characteristics we can divideeach factorThe time-domain waveforms energy graphs andzero-crossing rate graphs of ldquojiao Zhang Sheng yin cangzai qi pan zhi xiardquo in ldquoMatchmakerrsquos Wallrdquo sung by BeijingOpera were showed in Figure 2 From this we can see that
the consonant phonemes of the initial consonants are moreirregular and the consonants formed by them have a periodicwaveform The former has the characteristics of large zero-crossing rate and low energy characteristics the latter mostof the energy is larger In addition if silence appears both aresmall (red line is the beginning and end of a word)
22 Selection and Method of Characteristic Parameters
221 Choice of Personality Characteristics Whether it is Bei-jing Opera or general voice the speakerrsquos personal habits andpronunciation styles are different on the one hand and thespeakerrsquos position on the other (or the role of different actorsin Beijing Opera) will result in each person having a handleon each phoneme a little difference Generally speaking theparameters that characterize the speakerrsquos personality are thefeatures of the syllabic the suprasegmental and the linguistic[3 4]
Syllabic features they describe the tonal characteristicsof speech The characteristic parameters mainly include theposition of the formant the bandwidth of the formant thespectrum tilt the pitch frequency and the energy Segmentfeatures are mainly related to the physiological and phoneticfeatures of vocal organs and also to the speakerrsquos emotionalstateThe features used in the tone control model in Section 3are mainly for this reason
Supersonic characteristics they mainly refer to the wayof speaking such as the duration of phonemes pitch andstress what people feel is the rate of speech pitch and volumechanges The features used in the melodic control model inSection 4 are mainly for this reason
Language features for example idioms dialects accentand so on
However Beijing Opera and voice are different in theirpurpose of pronunciation and expressionThe pitch and pitchlength of each word in Beijing Opera are controlled by thescore in addition to its own pronunciation The ordinaryspeech is mainly used to express the content of the speechbut Beijing Opera is more emotionally expressed by melodyThrough the description of the above characteristics themain considerations of this test sound quality mapping of theresearch factors are as follows
Pitch it is determined by the vibration frequency of thesource in a period of timeThehigher the vibration frequencythe higher the sound and the lower the converse BeijingOperarsquos pitch and character roles such as LaoSheng arerelatively low Dan is relatively high
Pitch length the length of the sound is determined bythe duration of the sound source vibration The longer theduration the longer the sound and the shorter the other handThe average length of Beijing Opera per word is relativelylong and its variation range is relatively large
Sound intensity the strength of the sound depends onthe vibration amplitude of the sound source the greaterthe amplitude the stronger the sound on the other handthe lower the amplitude the smaller the sound Since theamplitude of Beijing Opera is controlled by strong emotionit is larger than the voice range In general the voice has onlya relatively small amplitude range of uniform distribution
Advances in Multimedia 3
Table 1 The correlations of subjective and objective amount of speech
Objective amount Subjective amountpitch volume tone duration
fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++
Relevance is positively related to the number of lsquo+rsquo s
Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres
By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1
Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic
222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows
(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as
119867(z) = 1 minus 120583zminus1 (1)
The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant
(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms
(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as
119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)
Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046
s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)
(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as
119883a (119896) = 119873minus1sum119899=0
119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)
where x (n) is the input speech signal andN is the numberof points in the Fourier transform
(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3
The frequency response of the triangular filter is definedas
4 Advances in Multimedia
119867m (119896) =
0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)
0 119896 ge 119891 (119898 + 1)(5)
(6)Calculate logarithmic energy output from each filter bankas
119904 (119898) = ln[119873minus1sum119896=0
10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)
(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as
119862 (119899) = 119873minus1sum119898=0
119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871
(7)
The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters
(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters
(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas
dt =
119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870
119896=1 1198962 others
119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)
where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter
23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods
231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)
232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as
119883n (119890119895120596) = +infinsum119898=minusinfin
119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)
The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics
(1) High frequency resolution the main lobe is narrowand sharp
(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow
However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the
Advances in Multimedia 5
f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)
(1(E)(2(E)
(3(E)(4(E)
(5(E)(6(E)
Figure 3 Mel frequency filter bank
0 1 2 3 4 5 6
060402
0minus02minus04
T I M E (s)
Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram
frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum
233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5
24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram
parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6
First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice
y (t) = sum119905119894isin119876
1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)
vt119894 119879(t119894) is shown as
V119905119894 (119905) = 1radic2120587 int+infin
minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)
119879 (119905119894) = sum119905119896isin119876119896lt119894
1radic119866 (1198910 (119905119896)) (12)
In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)
119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)
ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)
(14)
119888119905 (119902) = 1radic2120587sdot int+infin
minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)
q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 3
Table 1 The correlations of subjective and objective amount of speech
Objective amount Subjective amountpitch volume tone duration
fundamental frequency +++ + ++ +amplitude + +++ + +spectral envelope ++ + +++ +time + + + +++
Relevance is positively related to the number of lsquo+rsquo s
Tone the frequency performance of different soundsalways has distinctive characteristics in waveforms Forexample different Beijing Opera characters sing the samepassage according to the difference between the two timbres
By combining the subjective amount of speech with theobjective amount we have analyzed the correlations can beobtained in Table 1
Acoustic characteristics of speech signal are an indis-pensable research object for speech analysis and speechtransformation It mainly displays prosody and spectrumProsody perceives performance as pitch duration and vol-ume Acoustically the rhythm corresponds to the fundamen-tal frequency duration and amplitudeThe spectral envelopeis perceived as a tonal characteristic
222 MFCC Feature Extraction MFCC is an acronym forMel Frequency Cepstrum Coefficient (MFCC) which isbased on human auditory properties and is nonlinearlyrelated to Hz frequency The Mel Frequency Cepstral Coef-ficients (MFCCs) use this relationship between them tocalculate the resulting Hz spectral signature Its extractionprinciple is as follows
(1) Pre-Emphasis Pre-emphasis processing is to pass thespeech signal through a high-pass filter as
119867(z) = 1 minus 120583zminus1 (1)
The value of 120583 is between 09 and 10 we usually take 097The purpose of pre-emphasis is to raise the high-frequencypart flatten the spectrum of the signal and keep it in thewhole low-to-high frequency band with the same signal-to-noise ratio At the same time it is also to eliminate the vocalcords and lips in the process of occurrence to compensate forthe voice signal suppressed by the high frequency part of thesystem but also to highlight the high-frequency formant
(2) Framing The first N sampling points set into a unitof observation known as the frame Under normal cir-cumstances the value of N is 256 or 512 covering about20 sim 30ms or so In order to avoid the change of twoadjacent frames being too large there is an overlapping areabetween two adjacent frames The overlapping area containsM sampling points and the value of M is usually about12 or 13 of N Usually speech recognition [5] voice signalsampling frequency is 8KHz or 16KHz For 8KHz if theframe length is 256 samples the corresponding time lengthis (2568000)lowast1000=32ms
(3)Windowing (HammingWindow)Multiply each frame by aHamming window to increase continuity at the left and rightends of the frame Assuming the framed signal is S (n) n =01 N-1 N is the size of the frame then multiplied by theHamming window is 1198781015840(n) = 119878(n)times119882(n)The form ofW (n)is as
119882(119899 119886) = (1 minus 119886) minus 119886 times cos [ 2120587119899119873 minus 1] 0 le 119899 le 119873 minus 1 (2)
Different lsquoarsquo value will produce a different Hammingwindow In general rsquoarsquo takes 046
s1015840119899 = 054 minus 046 cos(2120587 (119899 minus 1)119873 minus 1 ) lowast 119904119899 (3)
(4) Fast Fourier Transform As the signal in the time-domaintransformation is usually difficult to see the characteristics ofthe signal so it is usually converted to the frequency domainto observe the energy distribution different energy distribu-tion can represent different voice characteristics Thereforeafter multiplying the Hamming window each frame mustalso be subjected to fast Fourier transform to obtain thespectral energy distribution The signal of each frame afterwindowing is subjected to fast Fourier transform to obtainthe spectrum of each frame And the spectrum of the speechsignal is modeled to obtain the power spectrum of the speechsignal Set the DFT of the voice signal as
119883a (119896) = 119873minus1sum119899=0
119909 (119899) 119890minus1198952120587119896119873 0 le 119896 le 119873 (4)
where x (n) is the input speech signal andN is the numberof points in the Fourier transform
(5) Triangular Bandpass FilterThe energy spectrum is passedthrough a set of Mel-scale triangular filter banks to define afilter bank with M filters (the number of filters is similar tothe number of critical bands) The filter used is a triangularfilter M usually takes 22-26 The spacing between each f (m)decreases with decreasing lsquomrsquo broadening as m increases asshown in Figure 3
The frequency response of the triangular filter is definedas
4 Advances in Multimedia
119867m (119896) =
0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)
0 119896 ge 119891 (119898 + 1)(5)
(6)Calculate logarithmic energy output from each filter bankas
119904 (119898) = ln[119873minus1sum119896=0
10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)
(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as
119862 (119899) = 119873minus1sum119898=0
119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871
(7)
The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters
(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters
(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas
dt =
119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870
119896=1 1198962 others
119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)
where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter
23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods
231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)
232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as
119883n (119890119895120596) = +infinsum119898=minusinfin
119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)
The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics
(1) High frequency resolution the main lobe is narrowand sharp
(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow
However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the
Advances in Multimedia 5
f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)
(1(E)(2(E)
(3(E)(4(E)
(5(E)(6(E)
Figure 3 Mel frequency filter bank
0 1 2 3 4 5 6
060402
0minus02minus04
T I M E (s)
Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram
frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum
233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5
24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram
parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6
First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice
y (t) = sum119905119894isin119876
1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)
vt119894 119879(t119894) is shown as
V119905119894 (119905) = 1radic2120587 int+infin
minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)
119879 (119905119894) = sum119905119896isin119876119896lt119894
1radic119866 (1198910 (119905119896)) (12)
In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)
119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)
ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)
(14)
119888119905 (119902) = 1radic2120587sdot int+infin
minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)
q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
4 Advances in Multimedia
119867m (119896) =
0 119896 lt 119891 (119898 minus 1)2 (119896 minus 119891 (119898 minus 1))[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898 minus 1) le 119896 le 119891 (119898)2 (119891 (119898 + 1) minus 119896)[119891 (119898 + 1) minus 119891 (119898 minus 1)] [119891 (119898) minus 119891 (119898 minus 1)] 119891 (119898) le 119896 le 119891 (119898 + 1)
0 119896 ge 119891 (119898 + 1)(5)
(6)Calculate logarithmic energy output from each filter bankas
119904 (119898) = ln[119873minus1sum119896=0
10038161003816100381610038161003816119883119886 (119896)210038161003816100381610038161003816 119867119898 (119896)] 0 le 119898 le 119872 (6)
(7) The MFCC coefficients are obtained by discrete cosinetransform (DCT) as
119862 (119899) = 119873minus1sum119898=0
119904 (119898) cos [120587119899 (119898 minus 05)119872 ] 119899 = 1 2 119871
(7)
The above logarithmic energy is taken intoDCT to obtainthe L-order Mel-scale Cepstrum parameter The L-ordermeans the MFCC coefficient order usually 12-16 Here M isthe number of triangular filters
(8) Logarithmic Energy In addition the volume (ie energy)of a frame is also an important feature of speech and is veryeasy to calculate Therefore the logarithmic energy of oneframe is usually added so that the basic speech features of eachframe have one more dimension including one logarithmicenergy and the remaining cepstrum parameters
(9) Dynamic Segmentation Parameters Extraction (includingFirst-Order Difference and Second-Order Difference) Thestandard cepstrum parameter MFCC only reflects the staticcharacteristics of the speech parameters The dynamic char-acteristics of the speech can be described by the differencespectrum of these static characteristics Experiments showthat combining dynamic and static features can effectivelyimprove the systemrsquos recognition performance The calcula-tion of the difference parameter can use the following formulaas
dt =
119862119905+1 minus 119862119905 119905 lt 119870sum119870119896=1 119896 (119862119905+119896 minus 119862119905minus119896)2sum119870
119896=1 1198962 others
119862119905 minus 119862119905minus1 119905 ge 119876 minus 119896(8)
where dt is the t-th first-order difference Ct is thet-th cepstrum coefficient Q is the order of the cepstralcoefficients and K is the time difference of the first derivativewhich can be 1 or 2 Substituting the result in the aboveequation yields the second-order difference parameter
23 Signal Characteristics Analysis According to the previ-ous research on speech signal processing technology peoplemainly focus on the signal analysis in the time domain andfrequency domain of these two methods
231 Time-Domain Analysis In the time domain the hori-zontal axis is the time and the vertical axis is the amplitudeBy observing thewaveform in the time domain we can obtainsome important features of the speech signal such as theduration the starting and ending positions of the syllablesthe sound intensity (energy) and vowels (see Figure 4)
232 Frequency Domain Analysis The voice signal spec-trum power spectrum cepstrum spectral envelope andso on are included It is generally considered that thefrequency spectrum of the speech signal is the product of thefrequency response of the channel system and the spectrumof the excitation source while the frequency response of thechannel system and the excitation source are time-varyingTherefore frequency domain analysis of speech signals isoften performed using short-time Fourier transform (STFT)It is defined as
119883n (119890119895120596) = +infinsum119898=minusinfin
119909 (119898)119908 (119899 minus 119898) 119890minus119895120596119898 (9)
The study of Chinese song synthesis algorithm is basedon parameter modification where we can see that short-time Fourier transform has two independent variables (nand w) so it is both a discrete function about time n and acontinuous function about angular frequency In the formulaw (n) is a window function and n takes different values andremoves different voice short segments where the subscriptn is different from the standard Fourier transform Sincethe shape of the window has an influence on the short-timespectrum the window function should have the followingcharacteristics
(1) High frequency resolution the main lobe is narrowand sharp
(2) Side lobe attenuation is large and spectrum leakagecaused by other frequency components is small These twoconditions are in fact contradictory to each other and cannotbe satisfied at the same time Therefore we often adopt acompromise approach and often choose aHammingwindow
However both time-domain analysis and frequencydomain analysis have their own limitations time-domainanalysis does not have an intuitive visualization of the
Advances in Multimedia 5
f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)
(1(E)(2(E)
(3(E)(4(E)
(5(E)(6(E)
Figure 3 Mel frequency filter bank
0 1 2 3 4 5 6
060402
0minus02minus04
T I M E (s)
Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram
frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum
233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5
24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram
parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6
First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice
y (t) = sum119905119894isin119876
1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)
vt119894 119879(t119894) is shown as
V119905119894 (119905) = 1radic2120587 int+infin
minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)
119879 (119905119894) = sum119905119896isin119876119896lt119894
1radic119866 (1198910 (119905119896)) (12)
In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)
119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)
ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)
(14)
119888119905 (119902) = 1radic2120587sdot int+infin
minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)
q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 5
f(1) f(2) f(3) f(4) f(5) f(6) f(7)f(0)
(1(E)(2(E)
(3(E)(4(E)
(5(E)(6(E)
Figure 3 Mel frequency filter bank
0 1 2 3 4 5 6
060402
0minus02minus04
T I M E (s)
Figure 4 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo time-domain diagram
frequency characteristics of speech signals and frequencydomain analysis lacks the variation of speech signals overtime As a result the experiment of the Beijing Operasynthesis analyzed the speech signal using the later improvedmethod of analyzing the spectrum
233 Spectrum Analysis The Fourier analysis display ofthe speech signal is called a sonogram or spectrogram Aspectrogram is a three-dimensional spectrum that representsa graph of the frequency spectrum of a voice over time withthe vertical axis as the frequency and the horizontal axis asthe time The intensity of any given frequency componentat a given moment is expressed in terms of the grayness orhue of the corresponding point The spectrum shows a greatdeal of information related to the characteristics of the speechsentence It combines the characteristics of spectrogramsand time-domain waveforms to clearly show how the speechspectrum changes over time or is a dynamic spectrum Fromthe spectrum we can get formant fundamental frequencyand other parameters in Figure 5
24 Straight Algorithm Introduction Straight is an acronymfor ldquoSpeech Transformation and Representation based onAdaptive Interpolation of weighted spectrogramrdquo It is amore accurate method of speech analysis and synthesisproposed by Japanese scholar Kawara Eiji in 1997The straightalgorithm builds on the sourcefilter model Among themthe source comes from the vocal cords vibration and thefilter refers to the channel transfer function It can adaptivelyinterpolate and smooth the speech short-duration spectrumin the time domain and the frequency domain so as to extractthe spectral envelope more accurately and adjust the speechduration fundamental frequency and spectral parameters toa great extent without affecting the quality of the synthesizedspeech The straight analysis synthesis algorithm consistsof three steps fundamental frequency extraction spectral
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
Figure 5 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo spectro-gram
parameter estimation and speech synthesis The first two ofthem are described in detail below and only the synthesisprocess will be described in Figure 6
First of all the speech signal is input the speech fun-damental frequency F0 and spectral envelope are extractedby straight algorithm and the parameters are modulatedto generate a new sound source and time-varying filterAccording to the original filter model we use (10) to synthvoice
y (t) = sum119905119894isin119876
1radic119866 (119891119863 (119905119894))V119905119894 (119905 minus 119879 (119905119894)) (10)
vt119894 119879(t119894) is shown as
V119905119894 (119905) = 1radic2120587 int+infin
minusinfin119881 (120596 119905119894) 120593 (120596) 119890119895120596(119905)119889120596 (11)
119879 (119905119894) = sum119905119896isin119876119896lt119894
1radic119866 (1198910 (119905119896)) (12)
In the formula Q represents the position of a group ofsamples in the synthesis excitation and G represents thepitch modulation The F0 after modulation can be matchedwith any F0 of the original language arbitrarily All-passfilter is used to control the time structure of fine pitchand original signal such as a frequency-proportional linearphase shift used to control the fine structure of F0 119881(120596 119905119894)is the corresponding Fourier transform of the minimumphase pulse as in (12)119860[119878(119906(120596) 119903(119905)) 119906(120596) 119903(119905)] is calculatedfrom the modulation amplitude spectrum where A u and rrepresent the modulation of amplitude frequency and timerespectively as (13) (14) and (15)
119881 (120596 119905) = 119890(1radic2120587) intinfin0 ℎ119905(119902)119890119895120603119902119889119902 (13)
ℎ119905 (119902) = 0 (119902 lt 0)119888119905 (0) (119902 = 0)2119888119905 (119902) (119902 gt 0)
(14)
119888119905 (119902) = 1radic2120587sdot int+infin
minusinfin119890minus119895120596119902lg119860 119878 [119906 (120596) 119903 (119905)] 119906 (120596) 119903 (119905) 119889120596 (15)
q is the frequency Straight audiometry experiments showthat even in the case of high-sensitivity headphones the
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
6 Advances in Multimedia
F0 Fundamental extraction Sound source
Parameter
adjustmentSpectral envelope extraction
Output synthetic speech
Voice parameters
input voice Time-varying filter
Figure 6 Straight synthesis system
synthesized speech signal is almost indistinguishable fromthe original signal
3 Tone Control Model
Voice tonal conversion refers to the voice signal processingtechnology to deal with the voice to maintain the samesemantic content but only change the tone so that a per-sonrsquos voice signal (source voice) after the sound conversionprocessing sounds like another person voice (target voice)This chapter introduces the extraction of the parametersthat are closely related to the timbre by using the straightalgorithm and then the training model of the extractedparameters by using GMM to get the corresponding rela-tionship between the source voice and the target voiceFinally the new parameters are straight synthesis in order toachieve voice conversion It can be seen from Section 2 thatthe tone characteristics in speech mainly correspond to theparameters ldquofundamental F0rdquo and ldquochannel spectrumrdquo
31 The Fundamental Frequency andChannel Spectrum Extraction
311 Extraction of the Fundamental Frequency Straight algo-rithm has a good time-domain resolution and fundamentalfrequency trajectory which is based on wavelet transformto analyze first found from the extracted audio frequencyto find the base frequency and then calculated the instanta-neous frequency as the fundamental frequency
Fundamentals of the extraction can be divided into threeparts F0 coarse positioning F0 track smooth and F0 finepositioningThe coarse positioning of F0 refers to the wavelettransform of the voice signal to obtain the wavelet coeffi-cients then the wavelet coefficients are transformed into aset of instantaneous frequencies for selecting F0 for eachframe F0 trajectory smoothing is based on the calculatedhigh-frequency energy ratio the minimum noise energyequivalent in the instantaneous frequency selected the mostlikely F0 and thus constitutes a smooth pitch trajectory F0fine positioning through FFT transforms the current F0 fine-tuning The process is as follows
Input signal is s(t) the output composite signal is119863(119905 120591119888)where gAG(t) is an analysis of the wavelet which is gotten by
the input signal through Gabor filter and 120591119888 is analysis cycleof the analyzed wavelet as
119863(119905 120591119888) = 10038161003816100381610038161205911198881003816100381610038161003816minus12 int+infin
minusinfin119904 (119905) 119892119860119866119905 minus 120583120591119888 119889120583 (16)
gAG(t) is (17) and shown as (18)
g119860119866 (119905) = 119892 (119905 minus 14) minus 119892(119905 + 14) (17)
119892 (119905) = 119890minus120587(119905120578)2119890minus1198952120587119905 (18)
Among them 120578 is the frequency resolution of the Gaborfilter which is usually larger than 1 according to the charac-teristics of the filter
Through calculation the variable ldquofundamentalnessrdquo isintroduced and denoted by 119872(119905 1205910) as
119872 = minus log [intΩ(119889 |119863|119889119906 )119889119906] + log [int
Ω|119863|2 119889119906]
minus log[intΩ(119889 arg |119863|119889119906 )2 119889119906] + 2 log 1205910
+ logΩ(1205910)(19)
The first term is the amplitude modulation (AM) valuethe second term is the total energy used to normalize thevalue of AM the third term is the frequency modulation(FM) value the fourth term is the square of the fundamentalfrequency used to normalize the value of FM the fifth is thenormalization factor of the time-domain integration intervalBy the formula the following can be drawn when AM FMtake the minimum M takes the maximum namely gettingthe fundamental part
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 7
However in practice F0 always changes rapidly so inorder to reduce the impact on M the formula makes someadjustments as (20) (21) and (22)
119872 = minus log[intΩ(119889 |119863|119889119906 minus 120583119865119872)2 119889119906]
+ log [intΩ|119863|2 119889119906]
minus log[intΩ(119889arg |119863|119889119906 minus 120583119865119872)2 119889119906]
+ 2 log 1205910 + logΩ(1205910)
(20)
120583119860119872 = 1Ω intΩ(119889 |119863|119889119906 ) (21)
120583119865119872 = 1Ω intΩ(1198892 arg (D)1198891199062 ) (22)
Finally use 1205910 to calculate the instantaneous frequency120596(119905) and get the fundamental frequency F0 by (23) (24) and(25)
1198910 = 1205960 (119905)2120587 (23)
120596 (119905) = 2119891119904 arcsin 1003816100381610038161003816119910119889 (119905)10038161003816100381610038162 (24)
y119889 (119905) = 119863 (119905 + Δ1199052 1205910)1003816100381610038161003816119863 (119905 + Δ1199052 1205910)1003816100381610038161003816 minus 119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816119863 (119905 minus Δ1199052 1205910)1003816100381610038161003816 (25)
312 Channel Spectral Parameter Extraction Thevoice of thesound source information and channel spectrum informa-tion extracted and then make adjustments to achieve voiceadjustment which is the previousmethodHowever since thetwo are often highly correlated they cannot be independentlymodified thus affecting the final result
The relationship among the voice signal s(t) the channelparameter v(t) and the sound source parameter p(t) is as
119904 (119905) = 119901 (119905) lowast V (119905) (26)
Since it is difficult to find v(t) directly the straightalgorithm calculates the frequency domain expression ofv(t) by short-time spectral analysis of s(t) The method tocalculate the short-term spectrum is (27) and (28)
119904119908 (119905 1199051015840) = 119904 (119905) 119908 (119905 1199051015840) (27)
119878119882(120596 1199051015840) = 119865119865119879 [119904119908 (119905 1199051015840)] = 119878 (120596 1199051015840)119882(120596 1199051015840) (28)
The short-term spectrum shows the periodicity relatedto the fundamental frequency in the time domain and the
frequency domain respectively The short-time spectrumwindow function used is (29) and (30)
119908 (119905) = 11198910 119890minus120587(1199051198910)2
(29)
119882(120596) = 1198910radic2120587119890minus120587(1205961205960)2 (30)
However since both the channel spectrum and the soundsource spectrum are related to the fundamental frequencyat this time it cannot be considered that they have beenseparated Instead they need to be further cyclically removedin the time domain and the frequency domain to achieve theseparation
Periodic removal of the time domain requires the designof pitch-sync smoothing windows and compensation win-dows respectively as (31) (32) and (33)
119908119901 (119905) = 119890minus120587(1199051205910) lowast ℎ( 1199051205910) (31)
ℎ (119905) = 1 minus |119905| (119905 lt 1)0 (119900119905ℎ119890119903119908119894119904119890) (32)
119908c (119905) = 119908119901 (119905) sin(120587 times 1199051205910) (33)
Then the short-time amplitude spectrum |119878p(120596 1199051015840)| and|119878p(120596 1199051015840)| respectively is obtained by the two windows andfinally we get the short-term amplitude spectrum with theperiodicity removed as
10038161003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038161003816 = radic1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 + 120585 1003816100381610038161003816119878r (120596 1199051015840)10038161003816100381610038162 (34)
Among them 120585 is the mixing factor when 120585 is taking013655 there is the optimal solution
Similarly the frequency domain also needs smoothingwindows 119881(120596) and compensation windows 119880(120596) to removethe periodicity in the short-time spectral 119878119882(120596) domain andfinally remove the periodic spectral envelope 1198781198781015840(120596) as
1198781198781015840 (120596) = 119878119882 (120596) lowast 119881 (120596) lowast 119880 (120596) (35)
Finally the logarithmic amplitude compression and dis-tortion frequency discrete cosine transform the channel spec-tral parameters into MFCC parameters for the subsequentuse(MFCC is described in detail in Section 2)
32 GMM Achieve Parameter Conversion
321 GMMProfile TheGaussianMixtureModel (GMM) [6]can be expressed as a linear combination of differentGaussianprobability functions in
119875(119883120582 ) = 119872sum119894=1
120596119894119887119894 (119883) (36)
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
8 Advances in Multimedia
where X is a random vector of n dimensions 120596i is amixture weight sum119872
i=1 120596119894 = 1 119887119894(119883) is a subdistribution ofGMM and each subdistribution is a Gaussian distribution as
119887119894 (119883) = 1(2120587)1198992 1003816100381610038161003816sum119894100381610038161003816100381612 119890
minus(12)(119883minus120583119894)119879summinus1119894 (119883minus120583119894) (37)
where 120583i is the mean vector and sumi is the covariancematrix
Although the types of phonemes are definite eachphoneme varies in different situations due to the contextWe use GMM to construct the acoustic characteristics of thespeaker to find the most likely mapping at each time
322 Conversion Function to Establish GMM refers to theestimation of the probability density distribution of thesample and the estimated model (training model) is theweighted sum of several Gaussian models It maps matricesof source speech and target speech thereby increasing theaccuracy and robustness of the algorithm and completing theconnection between the two phonetics
(1)The Conversion of Fundamental Frequency Here the singleGaussian model method is used to convert the fundamen-tal frequency and the converted fundamental frequency isobtained through the mean and variance of the target person(120583119905119892119905 120590119905119892119905) and the speaker (120583src 120590src) is (38)
1198910119888119900119899V (119905) = radic 12059021199051198921199051205902119904119903119888 times 1198910119904119903119888 (119905) + 120583119905119892119905 minus radic 12059021199051198921199051205902119904119903119888 times 120583119904119903119888 (38)
(2) Channel Spectrum Conversion The modelrsquos mapping ruleis a linear regression function the purpose is to predictthe required output data by inputting data The spectrumconversion function is defined as
119865 (119883) = 119864119884119883 = int119884 lowast 119875(119884119883)119889119884= 119872sum119894=1
119875119894 (119883)[[120583119884119894 + 119884119883sum119894(119883119883sum
119894)minus1 (119883 minus 120583119883119894 )]]
(39)
119875i (119883) = 120596119894119887119894 (119883119905)sum119872119896=1 120596119896119887119896 (119883119905) (40)
120583i = [120583119883i120583119884i ]
sum119894
= [[[[[[
119883119883sum119894
119883119884sum119894
119884119883sum119894
119884119884sum119894
]]]]]]
119894 = 1 119872
(41)
120583119883i and 120583119884i are the mean of the i-th Gaussian componentof the source speaker and the target speaker sum119883119884
119894 is the
jiao Zhang Sheng yin cang zai qi pan zhi xia
0 2 3 4 5 6 71
045
040
035
030
025
020
015
010
005
0
minus005
Figure 7 ldquoJiao Zhang Sheng yin cang zai qi pan zhi xiardquo envelope
variancematrix of the i-th Gaussian component of the sourcespeaker sum119883119884
119894 is the covariance matrix of the ith Gaussiancomponent of the source speaker and the target speakercovariance matrix 119875i(119883) is the feature vector probability ofX belonging to the i-th Gaussian components of the GMM
4 Melody Control Model
The composition of Beijing Opera has similarities with thesynthesis of general singing voice [7 8] That is throughthe superimposition of voice and melody the new pitchof each word is reconstructed Through the analysis of thesecond chapter it is found that the major factors affecting themelody are the fundamental frequency duration and energyAmong them the fundamental frequency of melody has thegreatest impact it can indicate the frequency of human vocalvibration and duration and pronunciation of each word ofeach word length you can control the rhythm of BeijingOpera which represents the speed of human voice Energyand sound intensity were positively correlated representingthe emotions
41 The Fundamental Frequency Conversion Model Al-though both speech and Beijing Opera are issued throughthe same human organs the speech pays more attention tothe prose while the Beijing Opera emphasizes the emotionalexpression of the melody Most of the features in the melodyare in the fundamental frequencyThe fundamental envelopeof a Beijing Opera corresponds to themelody which includestone pitch and tremolo [9] However the pitch of a note in anote is a constant and their comparison is as in Figure 7
From this we can see that we can use the fundamentalfrequency to control the melody of a Beijing Opera butthe acoustic effects such as vibrato need to be consideredTherefore the control design of the fundamental frequency[10] is as in Figure 8
42 Time Control Model Each word in Chinese usually hasdifferent syllables and the initials and vowels in each syllable
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 9
The fundamental frequency extractedfrom MIDI
Vibratoprocessing
Gaussian white noise High pass filter
Output of basic frequency curve
Figure 8 The control design of the fundamental frequency
Table 2 Duration parameters
Duration parametersBefore modification After modification
dur a klowastdur adur b dur bdur c dur t ndash (klowastdur a) ndash dur b
dur a initial part duration dur b initial part to vowel transition partduration dur c final part duration and dur t target total duration
also play different roles The initials whether normal orBeijing Opera usually play a supporting role while vowelscarry pitch and most of the pitch information In order toensure the naturalness of Beijing Opera we use the noteduration to control the length of each word and make therules for the vowel length shown in Table 2
Initial part of the length of time is in accordance withthe proportion [11] (k in the table) to be modified k is alot of voice and song comparison experiments are obtainedThe duration of the area with the initials to vowels transitionremains unchangedThe length of the vowel section varies sothat the total duration of the syllable can correspond to theduration of each note in the score
The method of dividing the vowel boundaries is intro-duced in Section 2 and will not be repeated here
43 Spectrum Control Model The vocal tract is a resonantcavity and the spectral envelope reflects its resonant proper-ties Studies have found good vibes singing the spectrum inthe vicinity of 25-3 kHz has a special resonance farm andsinging spectrum changes will directly affect the people ofthe partyrsquos results In order to synthesize music of high nat-uralness the spectral envelope of the speech signal is usuallycorrected according to the unique spectral characteristics ofthe singing voice
44 GAN Model
441 Introduction of GAN Network Generative adversarialnetworks abbreviated as GAN [12ndash17] are currently a hotresearch direction in artificial intelligenceThe GAN consistsof generators and discriminators The training process isinputting random noise obtaining pseudo data by the gener-ator taking a part of the real data from the true data mixingthe two and sending the data to the discriminator givinga true or false determination result and according to thisresult the return loss The purpose of GAN is to estimate thepotential distribution of data samples and generate new datasamples It is being extensively studied in the fields of image
and visual computing speech and language processing andhas a huge application prospect This study uses GAN tosynthesize music to compose Beijing Opera music
442 Selection of Test Datasets The Beijing Opera scoredataset that needs to be used in this study is the recording andcollection of 5000 Beijing Opera background music tracksThe dataset is processed as shown in the Figure 9 datasetpreparation and data preprocessing
First of all because sometimes some instruments haveonly a few notes in a piece of music this situation makes thedata too sparse and affects the training process Therefore itis necessary to solve this data imbalance problem by mergingthe sound tracks of similar instruments Each of the multi-track Beijing Opera scores is incorporated into five musicalinstruments huqins flutes suonas drums and cymbalsThese five types of instruments are the most commonly usedmusical instruments in Beijing Opera music
Then we will filter the datasets after the merged tracksto select the music with the best matching confidence Inaddition because the Beijing Opera arias need to be synthe-sized the scores in the part of the BeijingOperawithout lyricsare not what we need Also select the soundtracks of BeijingOpera lyrics
Finally in order to obtain a meaningful music segmentto train the time model it is necessary to divide the PekingOpera score and obtain the corresponding music segmentThink of the 4 bars as a passage and cut the longer passageinto the appropriate length Because pitches that are too highor too low are not common and are therefore less than C1 orhigher than C8 the target output tensor is 4 (bar) times 96 (timestep) times 84 (pitch) times 5 (track) This completes the preparationand preprocessing of the dataset
443 Training and Testing of GAN Structure and DatasetsThe GAN structure diagram used in this study is as inFigure 10
The basic framework of the GAN includes a pair ofmodels a generative model and a discriminative model Themain purpose is to generate pseudo data consistent withthe true data distribution by the discriminator D auxiliarygenerator G The input of the model is a random Gaussianwhite noise signal z the noise signal is mapped to a new dataspace via the generator G to generate the generated data G(z)Next a discriminator D outputs a probability value basedon the input of the true data x and the generated data G(z)respectively indicating that the D judges whether the inputis real data or the confidence of generating false data In thisway it is judged whether the performance of the G-generateddata is good or bad When the final D cannot distinguishbetween the real data x and the generated data G(z) thegenerator G is considered to be optimal
The goal of D is to distinguish between real data andfalse data so that D(x) is as large as possible while D(G(z))is as small as possible and the difference between thetwo is as large as possible whereas Grsquos goal is to makethe data it produces in D The goal of G is to make theperformance lsquoD(G(z))rsquo of its own data on D consistent with
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
10 Advances in Multimedia
Mergingaudio tracks
datasets
Beijing Operasoundtrack
Merge 5 tracksbull huqinsbull flutesbull suonasbull drumsbull cymbals
Beijing Operasoundtrack
Combineddatasets
Screeningmusic
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Datacleaning
Only select thefollowing sectionbull Best matching
confidencebull The soundtracks
of Beijing operalyrics
Training datasets
Figure 9 Illustration of the dataset preparation and data preprocessing procedure
z~p(z) G )
D realfake
G(z)
X
fake data
real data
random noise Generator
Discriminatorcritic
(wgan-gp)
4-bar phrases of 5 tracks
Figure 10 GAN structure diagram
the performance lsquoD(x)rsquo of the real data so that D cannotdistinguish between generated data and real data Thereforethe optimization process of the module is a process of mutualcompetition and confrontationThe performance of G and Dis continuously improved during repeated iteration until thefinal D(G(z)) is consistent with the performance D(x) of thereal data And both G and D cannot be further optimized
The training process can be modeled as a simpleMinMaxproblem in
minG
max119863
119863 (119909) minus 119863 (119866 (119911)) (42)
The MinMax optimization formula is defined as follows
minqG
max119902119863
119881 (119863119866) = min119866
max119863
119864119909119901119866 [log119863(119909)]+ 119864119909119901119866 [log (1 minus 119863 (119866 (119911)))] (43)
The GAN does not require a pre-set data distributionthat is it does not need to formulate a description ofp(x) but directly adopts it Theoretically it can completelyapproximate real data This is the biggest advantage of theGAN
The training and testing process of the GAN generatedmusic dataset is as in Figure 11
The generator-generated chord section data and specificmusic style data generator-generated multiple track chordsection data andmultiple tracks ofmusic groove data are sentto the GAN for training Reach music that generates specificstyles and corresponding grooves
5 Experiment
51 Tone Control Model
511 Experimental Process The voice library used in theexperiment simulation of this article is recorded by theformer in the environment of the entire anechoic roomand comprehensive consideration of the previous factorscan better meet the actual needs of the speech conversionsystemThe voice library is recorded by a woman in standardMandarin accent and contains numbers professional nounseveryday words etc as source speech Then find anotherperson to record a small number of statements as the voiceto be converted and Figure 12 is the tone conversion process
512 Experimental Results Figures 11 12 and 13 speech ofthe source speaker respectively and the target speaker isbased on the speech spectrogram STRAIGHT and convertspeechGMMmodel obtained All voices are sampled at 16khzand quantized with 16 bits Set the voice to 5s during theexperiment Their MFCCs are in Figures 13 14 and 15
They show the MFCC three-dimensional map of thesource speech the target speech and the converted speechThe horizontal axis represents the audio duration the verticalaxis represents the frequency and the color represents thecorresponding energy From the comparison of the graphsit can be directly seen that the vocalogram shape of theconverted MFCC parameters is closer to the target speechindicating that the converted speech features tend to be thetarget speech features
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 11
z
z
G
Gz GzGz Gz
zzzz
z
z
Gz
z
Gz
z
Gz
z
G
Bar Generator
Chords
Style
Chords
Groove
Figure 11 Raining and testing process of the GAN
fundamentalfrequency F0
Spectralenvelope
Source voice
STRAIGHT analysis
fundamentalfrequency F0
Spectralenvelope
TimeAlign-ment
GMM training to establishmapping rules
Single Gaussianmodel method
Calculate the meanand variance
Tone conversion
STRAIGHT synthesis
ConvertedMFCC
Convertedfundamentalfrequency F0
DTW
Source voice
STRAIGHT analysis
MFCC parameter conversion
To be converted voices
STRAIGHT analysis
MFCC parameter conversion
Conversionphase
Training phase
fundamentalfrequency F0
Spectralenvelope
Figure 12 Tone control model
52 Melody Control Model
521 Experimental Process In order to evaluate the qual-ity of the melody conversion results three Beijing Operapieces were selected for testing followed by conversionsusing Only dura dura F0 dura SP and all models andBeijing operas produced by the four synthesis methods werecompared with the original Beijing Opera Among themOnly dura uses only the duration controlmodel for synthesisdura F0 uses only the base frequency control model and theduration control model for synthesis dura SP uses only the
duration control model and the spectrum control model forsynthesis allmodels use three controlmodels simultaneouslyrsquoRealrsquo is the source Beijing Opera
So the melody control model can be summarized inFigure 16
522 Experimental Results The purpose of speech con-version is to make the converted speech sounds like thespeech of a specific target person Therefore evaluating theperformance of the speech conversion system is also based
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
12 Advances in Multimedia
Table 3 MOS grading
MOS gradingScore MOS Evaluation1 Uncomfortable and unbearable2 There is a sense of discomfort but it can endure3 Can detect distortion and feel uncomfortable4 Slightly perceived distortion but no discomfort5 Good sound quality no distortion
Table 4 Experimental results
Experimental results
ways MOS fractionBeiJing Opera1 BeiJing Opera2 BeiJing Opera3
Only dura 125 129 102dura F0 185 197 174dura SP 178 290 244all models 327 369 328real 5 5 5
0100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035
Figure 13 Source speech spectrogram
on human-oriented auditory evaluation In the existing sub-jective evaluation system the MOS score test is an effectivemethod for evaluating the voice quality and the similaritytest is a test method for judging the conversion effect of thesystem
TheMOS scoring criterion divides the speech quality into5 levels see Table 3The tester listens to the converted speechand gives the score of the quality level to which the measuredspeech belongs according to these 5 levels The MOS scoreis called the communication quality at about 35 minutesAt this time the voice quality of the auditory reconstructedvoice is reduced but it does not prevent people from talkingnormally If the MOS score is lower than 30 it is calledsynthetic speech quality At this time the speech has highintelligibility but the naturalness is poor
Find 10 testers and score MOS for the above compositeresults The results are shown in Table 4
53 Synthesis of Beijing Opera Beijing Opera is mainlycomposed of words and melodies The melody is determinedby the pitch tone sound intensity sound length and otherdecisions In this experiment each word is distinguished bythe unique characteristics of words such as zero-crossing rateand energy Then the tone control model and the melody
control model are designed and do extraction for importantparameters of the fundamental frequency spectrum timeand so on using MFCC DTW GMM and other tools toanalyze the extracted characteristic conversion and finally tothe opera synthetic fragment
Compared with other algorithms the straight algorithmhas better performance in terms of the natural degree ofsynthesis and the range of parameter modification so thestraight algorithm is also selected for the synthesis of theBeijing Opera
Again let the above-mentioned 10 testers perform MOSscoring on the above composite effect The result is shown inTable 5
According to the test results it can be seen that thesubjective test results reached an average of 37 pointsindicating that the design basically completed the BeijingOpera synthesis Although the Beijing Opera obtained by thesynthesis system tends to originate in Beijing Opera it is stillacoustically different from the real Beijing Opera
6 Conclusion
In this work we have presented three novel generativemodelsfor Beijing Opera synthesis under the frame work of thestraight algorithm GMM and GAN The objective metricsand the subjective user study show that the proposed modelscan achieve the synthesis of Beijing Opera Given the recententhusiasm in machine learning inspired art we hope tocontinue our work by introducing more complex models anddata representations that effectively capture the underlyingmelodic structure Furthermorewe feel thatmorework couldbe done in developing a better evaluationmetric of the qualityof a piece only then will we be able to train models that are
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 13
Table 5 Rating results
MOS Score
score studentsstudent1 student2 student3 student4 student5
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 4 4 3 3
score studentsstudent6 student7 student8 student9 student10
Source Opera fragment 5 5 5 5 5Synthetic Opera fragment 4 3 4 4 4
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
05 302010 15 25 4035 45
Figure 14 Target speech spectrogram
0
100020003000400050006000
Spec
trogr
amfre
quen
cy (H
Z)
3020 4515 25 403505 10
Figure 15 Converted speech spectrogram
Syllable fundamentalfrequency
Spectrum Envelope
Time length control modelvoice Feature extraction
Syllable duration
Note fundamental frequency
Length of note
Spectrum control model
Time length control model
F0 control model
synthesisBeijing Opera
MIDI
Figure 16 Melody control model
truly able to compose the Beijing Opera singing art workswith higher quality
Data Availability
The [wav] data of Beijing Opera used to support thefindings of this study have been deposited in the [zenodo]repository [httpdoiorg105281zenodo344932] The pre-viously reported straight algorithm used is available at
httpwwwwakayama-uacjpsimkawaharaSTRAIGHTadvindex ehtmll The code is available upon request fromkawaharasyswakayama-uacjp
Conflicts of Interest
The authors declare that they have no conflicts of interest
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
14 Advances in Multimedia
Acknowledgments
This work is sponsored by (1) the NSFC Key Fundingno 61631016 (2) the Cross Project ldquoResearch on 3D AudioSpace and Panoramic Interaction Based on VRrdquo no3132017XNG1750 and (3) the School Project Funding no2018XNG1857
References
[1] D Schwarz ldquoCorpus-based concatenative synthesisrdquo IEEE Sig-nal Processing Magazine vol 24 no 2 pp 92ndash104 2007
[2] J Cheng YHuang andCWu ldquoHMM-basedMandarin singingvoice synthesis using Tailored Synthesis Units and QuestionSetsrdquo Computational Linguistics and Chinese Language Process-ing vol 18 no 4 pp 63ndash80 2013
[3] L Sheng Speaker Conversion Method Research South ChinaUniversity of Technology doctoral dissertation 2014
[4] Y Yang Chinese Phonetic Transformation System [Masterrsquosthesis] Beijing Jiaotong University 2008
[5] S Hasim and etal ldquoFast and accurate recurrent neural networkacoustic models for speech recognitionrdquo httpsarxivorgabs150706947
[6] B Tang Research on Speech Conversion Technology Based onGMMModel vol 9 2017
[7] J Bonada and X Serra ldquoSynthesis of the singing voice by per-formance sampling and spectralmodelsrdquo IEEE Signal ProcessingMagazine vol 24 no 2 pp 67ndash79 2007
[8] M W Macon L Jensen-Link J Oliverio M A Clements andE B George ldquoSinging voice synthesis system based on sinu-soidal modelingrdquo in Proceedings of the 1997 IEEE InternationalConference on Acoustics Speech and Signal Processing ICASSPPart 1 (of 5) pp 435ndash438 April 1997
[9] H Gu and Z Lin ldquoMandarin singing voice synthesis usingANN vibrato parameter modelsrdquo in Proceedings of the 2008International Conference on Machine Learning and Cybernetics(ICMLC) pp 3288ndash3293 Kunming China July 2008
[10] A De Cheveigne and H Kawahara ldquoYIN a fundamentalfrequency estimator for speech and musicrdquo The Journal of theAcoustical Society of America vol 111 no 4 pp 1917ndash1930 2002
[11] C Lianhong J Hou R Liu et al ldquoSynthesis of HMMparametric singing based on pitchrdquo in Proceedings of the 5thJoint Conference on Harmonious Human-machine EnvironmentXirsquoan 2009
[12] W Wanliang and L Zhuorong ldquoAdvances in generative adver-sarial networkrdquo Journal of Communications vol 39 2018
[13] I Goodfellow J Pouget-Abadie M Mirza et al ldquoGenerativeadversarial netsrdquo in Proceedings of the Advances in neuralinformation processing systems pp 2672ndash2680 2014
[14] A Radford L Metz and S Chintala ldquoUnsupervised represen-tation learning with deep convolutional generative adversarialnetworksrdquo httpsarxivorgabs151106434
[15] Interpretable representation learning by information maximiz-ing generative adversarial nets
[16] A Nguyen et al ldquoSynthesizing the preferred inputs forneurons in neural networks via deep generator networksrdquohttpsarxivorgabs160509304
[17] I Goodfellow et al Deep Learning MIT Press CambridgeMass USA 2016
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
International Journal of
AerospaceEngineeringHindawiwwwhindawicom Volume 2018
RoboticsJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Active and Passive Electronic Components
VLSI Design
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Shock and Vibration
Hindawiwwwhindawicom Volume 2018
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawiwwwhindawicom
Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Control Scienceand Engineering
Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Journal ofEngineeringVolume 2018
SensorsJournal of
Hindawiwwwhindawicom Volume 2018
International Journal of
RotatingMachinery
Hindawiwwwhindawicom Volume 2018
Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom