supplementary materials...details are illustrated in figure2. 3dmm. in our 3dmm, m 0 and fvs i g 60...
TRANSCRIPT
![Page 1: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,](https://reader036.vdocuments.mx/reader036/viewer/2022071515/61371eaa0ad5d20676486a09/html5/thumbnails/1.jpg)
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053
054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107
CVPR#3760
CVPR#3760
CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Supplementary Materials
1. Network Structure DetailsStyle-specific audio-to-animation generator Gani.
The structural details of Gani is shown in Figure 1.
• (a)-(e): the 5 basic blocks in Gani with the setting ofkernel size, stride and padding.
• (f): the mapping from {Iref ∈ R3×512×512, faudio ∈RT×15} to faudioref ∈ RT×256.
• (g):the translation from faudioref to {pmou ∈RT×28, pebro ∈ RT×5, phed ∈ RT×5}.
• (h):the setting of channel size of each block.
Animation discriminator. In our experiment, Dmou,Debro and Dhed share the same structure. The structuraldetails are illustrated in Figure 2.
3DMM. In our 3DMM, M0 and {V si }60i=1 are preserved
as in LSFM[1]. We disentangle the mouth and eyebrowmovements, and sculpture 28 mouth bases and 5 eyebrowbases as our {V e
j }33j=1. The details of eyebrow and mouthbases are shown in Figure 3 and Figure 4 respectively.
Flow-guided video generator Gvid. In Gvid, thestructure of Houglass network is preserved as the originalpaper[2]. Three separate layers, after the Houglass network,with the setting of kernel size = 7, stride = 1 and padding =3, are employed to compute Mm, g and Mf . The structureof encoder-decoder is shown in Figure 5.
Discriminator Dvid. The structure of Dvid is as sameas [3]. In this paper, the number of multi-D is set to 1.
2. Implementation DetailsIn the training stage, Gani and Gvid are trained sep-
arately. In the audio-to-animation module, faudio is ex-tracted in 100 fps. pmou, pebro and phed are up-sample to100 fps with the linear interpolation. 1000 frames are splitas one sample by the temporal window. The slide stride isset to 128. We set λmou = 100, λebro = λhed = 10. Thebatch size is 128. The learning rate is initially set to 0.0005,and stay fixed in the first 50 epoches and linearly decay to 0within another 50 epoches. It takes about 26 hours to trainGani.
In the animation-to-video module, we set λperc =λFM = 10. The batch size is set to 2 with 2 1080ti GPU.In the first 75 epoches, the learning rate is fixed in 0.0004.In the last 75 epoches, the learning rate linearly decay to 0.We take about two weeks to train Gvid. Adam optimizer isused in Gani and Gvid with default setting.
In the inference stage, we first extract the face shapeparameter ps and initial face animation parameters pmou
init ,
pebroinit and phedinit from Iref . Then, we synthesize the{pmou, pebro, phed} from {Iref , faudio}. Finally, we syn-thesize the video frame-by-frame according to generatedanimation parameters. Our framework produces videos atabout 2 frames per second.
In the experiments, to evaluate our audio-to-animationmodule and animation-to-video module, 5% of the YADdataset is randomly selected for testing.
3. Demo VideoA video demo is also included in the supplementary ma-
terial.
4. Ethical ConsiderationWe strongly advocate using our technology properly. To
prevent the abuse of our method, anyone who employ ourmethod to synthesize fake videos should mark with ”fakevideo”. Fortunately, detecting the synthetic and manip-ulated videos has got much attention and achieved muchprogress. As part of our responsibility, we are happy to pro-mote the development of detection methodologies by shar-ing our dataset, source codes for their future research.
References[1] James Booth, Anastasios Roussos, Allan Ponniah, David
Dunaway, and Stefanos Zafeiriou. Large scale 3d morphablemodels. International Journal of Computer Vision, 126(2-4):233–254, 2018.
[2] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In European con-ference on computer vision, pages 483–499. Springer, 2016.
[3] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. First order motion model for im-age animation. In Advances in Neural Information ProcessingSystems, pages 7137–7147, 2019.
![Page 2: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,](https://reader036.vdocuments.mx/reader036/viewer/2022071515/61371eaa0ad5d20676486a09/html5/thumbnails/2.jpg)
108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
CVPR#3760
CVPR#3760
CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ID
embedding
Head pose decoder
Pre-trained
VGG-Face
Audio feature
extractor
Eyebrow decoder
Mouth decoder
Mouth branch
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
1D
BatchNorm
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Relu
Res1D block(BN)
Input
Output
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Relu
Res1D block
Input
Output
Add Add
FC layer
Relu
FC layer
Relu
FC layer FC layer
Mean Std
AdaIN block
Input
Conv1D
Kernel size = 3
Stride = 2
Padding = 1
Relu
Res1D
block(Down)
Input
Output
Add
Conv1D
Kernel size = 1
Stride = 0
Padding = 0
Conv1D
Kernel size = 1
Stride = 0
Padding = 0
1D
BatchNorm
Relu
Res1D
block(Up)
Input
Output
Add
Up sample
Linear
interpolation
(size x 2)
Up sample
Linear
interpolation
(size x 2)
1D
BatchNorm
Up sample
Linear
interpolation
(size x 2)
Up sample
Linear
interpolation
(size x 2)
Conv1D
Kernel size = 3
Stride = 1
Padding = 1
Res1D block(BN)
Layer num:18
In_channel:256
Out_hannel:256
(a) (b) (c) (d) (e)
AdaIN block
In_channel:512
Hidden_channel:256
Out_channel:256
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 15
Out_channel = 256
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 15
Out_channel = 256
(f)
Res1D block
In_channel:256
Out_channel:256
Res1D block
In_channel:256
Out_channel:256
AdaIN block
In_channel:256
Hidden_channel:256
Out_channel:256
Res1D block
In_channel:256
Out_channel:256
AdaIN block
In_channel:256
Hidden_channel:256
Out_channel:256
Res1D block(Down)
Layer num:4
In_channel:256
Out_channel:256
Res1D block(Up)
Layer num:4
In_channel:256
Out_channel:256
Long-time temporal decoder
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Res1D block(BN)
Layer num:8
In_channel:256
Out_hannel:256
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 28
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 28
Eyebrow/head branch
(g)
ID
embedding
Head pose decoder
Pre-trained
VGG-Face
Audio feature
extractor
Eyebrow decoder
Mouth decoder
Mouth branch
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
1D
BatchNorm
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Relu
Res1D block(BN)
Input
Output
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Conv1D
Kernel size = 5
Stride = 1
Padding = 2
Relu
Res1D block
Input
Output
Add Add
FC layer
Relu
FC layer
Relu
FC layer FC layer
Mean Std
AdaIN block
Input
Conv1D
Kernel size = 3
Stride = 2
Padding = 1
Relu
Res1D
block(Down)
Input
Output
Add
Conv1D
Kernel size = 1
Stride = 0
Padding = 0
1D
BatchNorm
Relu
Res1D
block(Up)
Input
Output
Add
Up sample
Linear
interpolation
(size x 2)
1D
BatchNorm
Up sample
Linear
interpolation
(size x 2)
Conv1D
Kernel size = 3
Stride = 1
Padding = 1
Res1D block(BN)
Layer num:18
In_channel:256
Out_hannel:256
(a) (b) (c) (d) (e)
AdaIN block
In_channel:512
Hidden_channel:256
Out_channel:256
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 15
Out_channel = 256
(f)
Res1D block
In_channel:256
Out_channel:256
Res1D block
In_channel:256
Out_channel:256
AdaIN block
In_channel:256
Hidden_channel:256
Out_channel:256
Res1D block
In_channel:256
Out_channel:256
AdaIN block
In_channel:256
Hidden_channel:256
Out_channel:256
Res1D block(Down)
Layer num:4
In_channel:256
Out_channel:256
Res1D block(Up)
Layer num:4
In_channel:256
Out_channel:256
Long-time temporal decoder
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 5
Res1D block(BN)
Layer num:8
In_channel:256
Out_hannel:256
Conv1D
Kernel size = 1
Stride = 1
Padding = 0
In_channel = 256
Out_channel = 28
Eyebrow/head branch
(g)
Figure 1. The structural details of style-specific animation generator Gani.
2
![Page 3: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,](https://reader036.vdocuments.mx/reader036/viewer/2022071515/61371eaa0ad5d20676486a09/html5/thumbnails/3.jpg)
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
CVPR#3760
CVPR#3760
CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Conv1D
Layer_num = 1
C_in:43/20
C_out:256
Kernel = 1
Stride = 1
Pad = 0
Relu
Conv1D
Layer_num = 6
C_in:256
C_out:256
Kernel = 5
Stride = 2
Pad = 2
Relu
Conv1D
Layer_num = 2
C_in:256
C_out:256
Kernel = 3
Stride = 1
Pad = 1
Relu
Conv1D
Layer_num = 1
C_in:256
C_out:28/5
Kernel = 1
Stride = 1
Pad = 0
Conv1D
Layer_num = 1
C_in:43/20
C_out:256
Kernel = 1
Stride = 1
Pad = 0
Relu
Conv1D
Layer_num = 6
C_in:256
C_out:256
Kernel = 5
Stride = 2
Pad = 2
Relu
Conv1D
Layer_num = 2
C_in:256
C_out:256
Kernel = 3
Stride = 1
Pad = 1
Relu
Conv1D
Layer_num = 1
C_in:256
C_out:28/5
Kernel = 1
Stride = 1
Pad = 0
Figure 2. The structural details of animation discriminator Dmou, Debro, Dhed.
Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)
Figure 3. The eyebrow basis in 3DMM.
Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)
Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)
Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)
Lip stretch(left)
Lip stretch(right) Mouth frown(left)
Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel
Mouth left
Mouth right
Chin raise(lower) Chin raise(upper) Puff
Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)
Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)
Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)
Lip stretch(left)
Lip stretch(right) Mouth frown(left)
Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel
Mouth left
Mouth right
Chin raise(lower) Chin raise(upper) Puff
Figure 4. The mouth basis in 3DMM.
3
![Page 4: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,](https://reader036.vdocuments.mx/reader036/viewer/2022071515/61371eaa0ad5d20676486a09/html5/thumbnails/4.jpg)
324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377
378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431
CVPR#3760
CVPR#3760
CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Conv2D
Layer_num = 1
C_in:3
C_out:64
Kernel = 7
Stride = 1
Pad = 3
BN+
Relu
Conv2D
Layer_num = 2
C_in:64
C_hidden:128
C_out:256
Kernel = 3
Stride = 2
Pad = 1
BN+
Relu
Conv2D
Layer_num = 2
C_in:64
C_hidden:128
C_out:256
Kernel = 3
Stride = 2
Pad = 1
BN+
Relu Matting
operation
Conv2D
Layer_num = 2
C_in:256
C_hidden:256
C_out:256
Kernel = 3
Stride = 1
Pad = 1
BN+
Relu
Up+Conv2D
Linear Upsample x 2
Layer_num = 2
C_in:256
C_hidden:128
C_out:64
Kernel = 3
Stride = 1
Pad = 1
BN+
Relu
Conv2D
Layer_num = 1
C_in:64
C_out:3
Kernel = 7
Stride = 1
Pad = 3
Result
image
Conv2D
Layer_num = 1
C_in:3
C_out:64
Kernel = 7
Stride = 1
Pad = 3
BN+
Relu
Conv2D
Layer_num = 2
C_in:64
C_hidden:128
C_out:256
Kernel = 3
Stride = 2
Pad = 1
BN+
Relu
Conv2D
Layer_num = 2
C_in:64
C_hidden:128
C_out:256
Kernel = 3
Stride = 2
Pad = 1
BN+
Relu Matting
operation
Conv2D
Layer_num = 2
C_in:256
C_hidden:256
C_out:256
Kernel = 3
Stride = 1
Pad = 1
BN+
Relu
Up+Conv2D
Linear Upsample x 2
Layer_num = 2
C_in:256
C_hidden:128
C_out:64
Kernel = 3
Stride = 1
Pad = 1
BN+
Relu
Conv2D
Layer_num = 1
C_in:64
C_out:3
Kernel = 7
Stride = 1
Pad = 3
Result
image
Figure 5. The structural details of encoder-decoder network in Gvid.
4