Transcript
Page 1: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Supplementary Materials

1. Network Structure DetailsStyle-specific audio-to-animation generator Gani.

The structural details of Gani is shown in Figure 1.

• (a)-(e): the 5 basic blocks in Gani with the setting ofkernel size, stride and padding.

• (f): the mapping from {Iref ∈ R3×512×512, faudio ∈RT×15} to faudioref ∈ RT×256.

• (g):the translation from faudioref to {pmou ∈RT×28, pebro ∈ RT×5, phed ∈ RT×5}.

• (h):the setting of channel size of each block.

Animation discriminator. In our experiment, Dmou,Debro and Dhed share the same structure. The structuraldetails are illustrated in Figure 2.

3DMM. In our 3DMM, M0 and {V si }60i=1 are preserved

as in LSFM[1]. We disentangle the mouth and eyebrowmovements, and sculpture 28 mouth bases and 5 eyebrowbases as our {V e

j }33j=1. The details of eyebrow and mouthbases are shown in Figure 3 and Figure 4 respectively.

Flow-guided video generator Gvid. In Gvid, thestructure of Houglass network is preserved as the originalpaper[2]. Three separate layers, after the Houglass network,with the setting of kernel size = 7, stride = 1 and padding =3, are employed to compute Mm, g and Mf . The structureof encoder-decoder is shown in Figure 5.

Discriminator Dvid. The structure of Dvid is as sameas [3]. In this paper, the number of multi-D is set to 1.

2. Implementation DetailsIn the training stage, Gani and Gvid are trained sep-

arately. In the audio-to-animation module, faudio is ex-tracted in 100 fps. pmou, pebro and phed are up-sample to100 fps with the linear interpolation. 1000 frames are splitas one sample by the temporal window. The slide stride isset to 128. We set λmou = 100, λebro = λhed = 10. Thebatch size is 128. The learning rate is initially set to 0.0005,and stay fixed in the first 50 epoches and linearly decay to 0within another 50 epoches. It takes about 26 hours to trainGani.

In the animation-to-video module, we set λperc =λFM = 10. The batch size is set to 2 with 2 1080ti GPU.In the first 75 epoches, the learning rate is fixed in 0.0004.In the last 75 epoches, the learning rate linearly decay to 0.We take about two weeks to train Gvid. Adam optimizer isused in Gani and Gvid with default setting.

In the inference stage, we first extract the face shapeparameter ps and initial face animation parameters pmou

init ,

pebroinit and phedinit from Iref . Then, we synthesize the{pmou, pebro, phed} from {Iref , faudio}. Finally, we syn-thesize the video frame-by-frame according to generatedanimation parameters. Our framework produces videos atabout 2 frames per second.

In the experiments, to evaluate our audio-to-animationmodule and animation-to-video module, 5% of the YADdataset is randomly selected for testing.

3. Demo VideoA video demo is also included in the supplementary ma-

terial.

4. Ethical ConsiderationWe strongly advocate using our technology properly. To

prevent the abuse of our method, anyone who employ ourmethod to synthesize fake videos should mark with ”fakevideo”. Fortunately, detecting the synthetic and manip-ulated videos has got much attention and achieved muchprogress. As part of our responsibility, we are happy to pro-mote the development of detection methodologies by shar-ing our dataset, source codes for their future research.

References[1] James Booth, Anastasios Roussos, Allan Ponniah, David

Dunaway, and Stefanos Zafeiriou. Large scale 3d morphablemodels. International Journal of Computer Vision, 126(2-4):233–254, 2018.

[2] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In European con-ference on computer vision, pages 483–499. Springer, 2016.

[3] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. First order motion model for im-age animation. In Advances in Neural Information ProcessingSystems, pages 7137–7147, 2019.

Page 2: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

ID

embedding

Head pose decoder

Pre-trained

VGG-Face

Audio feature

extractor

Eyebrow decoder

Mouth decoder

Mouth branch

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

1D

BatchNorm

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block(BN)

Input

Output

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block

Input

Output

Add Add

FC layer

Relu

FC layer

Relu

FC layer FC layer

Mean Std

AdaIN block

Input

Conv1D

Kernel size = 3

Stride = 2

Padding = 1

Relu

Res1D

block(Down)

Input

Output

Add

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

1D

BatchNorm

Relu

Res1D

block(Up)

Input

Output

Add

Up sample

Linear

interpolation

(size x 2)

Up sample

Linear

interpolation

(size x 2)

1D

BatchNorm

Up sample

Linear

interpolation

(size x 2)

Up sample

Linear

interpolation

(size x 2)

Conv1D

Kernel size = 3

Stride = 1

Padding = 1

Res1D block(BN)

Layer num:18

In_channel:256

Out_hannel:256

(a) (b) (c) (d) (e)

AdaIN block

In_channel:512

Hidden_channel:256

Out_channel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

(f)

Res1D block

In_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block(Down)

Layer num:4

In_channel:256

Out_channel:256

Res1D block(Up)

Layer num:4

In_channel:256

Out_channel:256

Long-time temporal decoder

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Res1D block(BN)

Layer num:8

In_channel:256

Out_hannel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Eyebrow/head branch

(g)

ID

embedding

Head pose decoder

Pre-trained

VGG-Face

Audio feature

extractor

Eyebrow decoder

Mouth decoder

Mouth branch

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

1D

BatchNorm

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block(BN)

Input

Output

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block

Input

Output

Add Add

FC layer

Relu

FC layer

Relu

FC layer FC layer

Mean Std

AdaIN block

Input

Conv1D

Kernel size = 3

Stride = 2

Padding = 1

Relu

Res1D

block(Down)

Input

Output

Add

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

1D

BatchNorm

Relu

Res1D

block(Up)

Input

Output

Add

Up sample

Linear

interpolation

(size x 2)

1D

BatchNorm

Up sample

Linear

interpolation

(size x 2)

Conv1D

Kernel size = 3

Stride = 1

Padding = 1

Res1D block(BN)

Layer num:18

In_channel:256

Out_hannel:256

(a) (b) (c) (d) (e)

AdaIN block

In_channel:512

Hidden_channel:256

Out_channel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

(f)

Res1D block

In_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block(Down)

Layer num:4

In_channel:256

Out_channel:256

Res1D block(Up)

Layer num:4

In_channel:256

Out_channel:256

Long-time temporal decoder

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Res1D block(BN)

Layer num:8

In_channel:256

Out_hannel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Eyebrow/head branch

(g)

Figure 1. The structural details of style-specific animation generator Gani.

2

Page 3: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Conv1D

Layer_num = 1

C_in:43/20

C_out:256

Kernel = 1

Stride = 1

Pad = 0

Relu

Conv1D

Layer_num = 6

C_in:256

C_out:256

Kernel = 5

Stride = 2

Pad = 2

Relu

Conv1D

Layer_num = 2

C_in:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

Relu

Conv1D

Layer_num = 1

C_in:256

C_out:28/5

Kernel = 1

Stride = 1

Pad = 0

Conv1D

Layer_num = 1

C_in:43/20

C_out:256

Kernel = 1

Stride = 1

Pad = 0

Relu

Conv1D

Layer_num = 6

C_in:256

C_out:256

Kernel = 5

Stride = 2

Pad = 2

Relu

Conv1D

Layer_num = 2

C_in:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

Relu

Conv1D

Layer_num = 1

C_in:256

C_out:28/5

Kernel = 1

Stride = 1

Pad = 0

Figure 2. The structural details of animation discriminator Dmou, Debro, Dhed.

Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)

Figure 3. The eyebrow basis in 3DMM.

Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)

Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)

Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)

Lip stretch(left)

Lip stretch(right) Mouth frown(left)

Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel

Mouth left

Mouth right

Chin raise(lower) Chin raise(upper) Puff

Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)

Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)

Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)

Lip stretch(left)

Lip stretch(right) Mouth frown(left)

Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel

Mouth left

Mouth right

Chin raise(lower) Chin raise(upper) Puff

Figure 4. The mouth basis in 3DMM.

3

Page 4: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Conv2D

Layer_num = 1

C_in:3

C_out:64

Kernel = 7

Stride = 1

Pad = 3

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu Matting

operation

Conv2D

Layer_num = 2

C_in:256

C_hidden:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Up+Conv2D

Linear Upsample x 2

Layer_num = 2

C_in:256

C_hidden:128

C_out:64

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Conv2D

Layer_num = 1

C_in:64

C_out:3

Kernel = 7

Stride = 1

Pad = 3

Result

image

Conv2D

Layer_num = 1

C_in:3

C_out:64

Kernel = 7

Stride = 1

Pad = 3

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu Matting

operation

Conv2D

Layer_num = 2

C_in:256

C_hidden:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Up+Conv2D

Linear Upsample x 2

Layer_num = 2

C_in:256

C_hidden:128

C_out:64

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Conv2D

Layer_num = 1

C_in:64

C_out:3

Kernel = 7

Stride = 1

Pad = 3

Result

image

Figure 5. The structural details of encoder-decoder network in Gvid.

4


Top Related