supplementary materials...details are illustrated in figure2. 3dmm. in our 3dmm, m 0 and fvs i g 60...

4
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #3760 CVPR #3760 CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Supplementary Materials 1. Network Structure Details Style-specific audio-to-animation generator G ani . The structural details of G ani is shown in Figure 1. (a)-(e): the 5 basic blocks in G ani with the setting of kernel size, stride and padding. (f): the mapping from {I ref R 3×512×512 ,f audio R T ×15 } to ˆ f audio ref R T ×256 . (g):the translation from ˆ f audio ref to {p mou R T ×28 ,p ebro R T ×5 ,p hed R T ×5 }. (h):the setting of channel size of each block. Animation discriminator. In our experiment, D mou , D ebro and D hed share the same structure. The structural details are illustrated in Figure 2. 3DMM. In our 3DMM, M 0 and {V s i } 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements, and sculpture 28 mouth bases and 5 eyebrow bases as our {V e j } 33 j=1 . The details of eyebrow and mouth bases are shown in Figure 3 and Figure 4 respectively. Flow-guided video generator G vid . In G vid , the structure of Houglass network is preserved as the original paper[2]. Three separate layers, after the Houglass network, with the setting of kernel size = 7, stride = 1 and padding = 3, are employed to compute M m , g and M f . The structure of encoder-decoder is shown in Figure 5. Discriminator D vid . The structure of D vid is as same as [3]. In this paper, the number of multi-D is set to 1. 2. Implementation Details In the training stage, G ani and G vid are trained sep- arately. In the audio-to-animation module, f audio is ex- tracted in 100 fps. p mou , p ebro and p hed are up-sample to 100 fps with the linear interpolation. 1000 frames are split as one sample by the temporal window. The slide stride is set to 128. We set λ mou = 100, λ ebro = λ hed = 10. The batch size is 128. The learning rate is initially set to 0.0005, and stay fixed in the first 50 epoches and linearly decay to 0 within another 50 epoches. It takes about 26 hours to train G ani . In the animation-to-video module, we set λ perc = λ FM = 10. The batch size is set to 2 with 2 1080ti GPU. In the first 75 epoches, the learning rate is fixed in 0.0004. In the last 75 epoches, the learning rate linearly decay to 0. We take about two weeks to train G vid . Adam optimizer is used in G ani and G vid with default setting. In the inference stage, we first extract the face shape parameter p s and initial face animation parameters p mou init , p ebro init and p hed init from I ref . Then, we synthesize the { ˆ p mou , ˆ p ebro , ˆ p hed } from {I ref ,f audio }. Finally, we syn- thesize the video frame-by-frame according to generated animation parameters. Our framework produces videos at about 2 frames per second. In the experiments, to evaluate our audio-to-animation module and animation-to-video module, 5% of the YAD dataset is randomly selected for testing. 3. Demo Video A video demo is also included in the supplementary ma- terial. 4. Ethical Consideration We strongly advocate using our technology properly. To prevent the abuse of our method, anyone who employ our method to synthesize fake videos should mark with ”fake video”. Fortunately, detecting the synthetic and manip- ulated videos has got much attention and achieved much progress. As part of our responsibility, we are happy to pro- mote the development of detection methodologies by shar- ing our dataset, source codes for their future research. References [1] James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. Large scale 3d morphable models. International Journal of Computer Vision, 126(2- 4):233–254, 2018. [2] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. In European con- ference on computer vision, pages 483–499. Springer, 2016. [3] Aliaksandr Siarohin, St´ ephane Lathuili` ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for im- age animation. In Advances in Neural Information Processing Systems, pages 7137–7147, 2019.

Upload: others

Post on 02-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Supplementary Materials

1. Network Structure DetailsStyle-specific audio-to-animation generator Gani.

The structural details of Gani is shown in Figure 1.

• (a)-(e): the 5 basic blocks in Gani with the setting ofkernel size, stride and padding.

• (f): the mapping from {Iref ∈ R3×512×512, faudio ∈RT×15} to faudioref ∈ RT×256.

• (g):the translation from faudioref to {pmou ∈RT×28, pebro ∈ RT×5, phed ∈ RT×5}.

• (h):the setting of channel size of each block.

Animation discriminator. In our experiment, Dmou,Debro and Dhed share the same structure. The structuraldetails are illustrated in Figure 2.

3DMM. In our 3DMM, M0 and {V si }60i=1 are preserved

as in LSFM[1]. We disentangle the mouth and eyebrowmovements, and sculpture 28 mouth bases and 5 eyebrowbases as our {V e

j }33j=1. The details of eyebrow and mouthbases are shown in Figure 3 and Figure 4 respectively.

Flow-guided video generator Gvid. In Gvid, thestructure of Houglass network is preserved as the originalpaper[2]. Three separate layers, after the Houglass network,with the setting of kernel size = 7, stride = 1 and padding =3, are employed to compute Mm, g and Mf . The structureof encoder-decoder is shown in Figure 5.

Discriminator Dvid. The structure of Dvid is as sameas [3]. In this paper, the number of multi-D is set to 1.

2. Implementation DetailsIn the training stage, Gani and Gvid are trained sep-

arately. In the audio-to-animation module, faudio is ex-tracted in 100 fps. pmou, pebro and phed are up-sample to100 fps with the linear interpolation. 1000 frames are splitas one sample by the temporal window. The slide stride isset to 128. We set λmou = 100, λebro = λhed = 10. Thebatch size is 128. The learning rate is initially set to 0.0005,and stay fixed in the first 50 epoches and linearly decay to 0within another 50 epoches. It takes about 26 hours to trainGani.

In the animation-to-video module, we set λperc =λFM = 10. The batch size is set to 2 with 2 1080ti GPU.In the first 75 epoches, the learning rate is fixed in 0.0004.In the last 75 epoches, the learning rate linearly decay to 0.We take about two weeks to train Gvid. Adam optimizer isused in Gani and Gvid with default setting.

In the inference stage, we first extract the face shapeparameter ps and initial face animation parameters pmou

init ,

pebroinit and phedinit from Iref . Then, we synthesize the{pmou, pebro, phed} from {Iref , faudio}. Finally, we syn-thesize the video frame-by-frame according to generatedanimation parameters. Our framework produces videos atabout 2 frames per second.

In the experiments, to evaluate our audio-to-animationmodule and animation-to-video module, 5% of the YADdataset is randomly selected for testing.

3. Demo VideoA video demo is also included in the supplementary ma-

terial.

4. Ethical ConsiderationWe strongly advocate using our technology properly. To

prevent the abuse of our method, anyone who employ ourmethod to synthesize fake videos should mark with ”fakevideo”. Fortunately, detecting the synthetic and manip-ulated videos has got much attention and achieved muchprogress. As part of our responsibility, we are happy to pro-mote the development of detection methodologies by shar-ing our dataset, source codes for their future research.

References[1] James Booth, Anastasios Roussos, Allan Ponniah, David

Dunaway, and Stefanos Zafeiriou. Large scale 3d morphablemodels. International Journal of Computer Vision, 126(2-4):233–254, 2018.

[2] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In European con-ference on computer vision, pages 483–499. Springer, 2016.

[3] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. First order motion model for im-age animation. In Advances in Neural Information ProcessingSystems, pages 7137–7147, 2019.

Page 2: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

ID

embedding

Head pose decoder

Pre-trained

VGG-Face

Audio feature

extractor

Eyebrow decoder

Mouth decoder

Mouth branch

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

1D

BatchNorm

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block(BN)

Input

Output

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block

Input

Output

Add Add

FC layer

Relu

FC layer

Relu

FC layer FC layer

Mean Std

AdaIN block

Input

Conv1D

Kernel size = 3

Stride = 2

Padding = 1

Relu

Res1D

block(Down)

Input

Output

Add

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

1D

BatchNorm

Relu

Res1D

block(Up)

Input

Output

Add

Up sample

Linear

interpolation

(size x 2)

Up sample

Linear

interpolation

(size x 2)

1D

BatchNorm

Up sample

Linear

interpolation

(size x 2)

Up sample

Linear

interpolation

(size x 2)

Conv1D

Kernel size = 3

Stride = 1

Padding = 1

Res1D block(BN)

Layer num:18

In_channel:256

Out_hannel:256

(a) (b) (c) (d) (e)

AdaIN block

In_channel:512

Hidden_channel:256

Out_channel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

(f)

Res1D block

In_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block(Down)

Layer num:4

In_channel:256

Out_channel:256

Res1D block(Up)

Layer num:4

In_channel:256

Out_channel:256

Long-time temporal decoder

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Res1D block(BN)

Layer num:8

In_channel:256

Out_hannel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Eyebrow/head branch

(g)

ID

embedding

Head pose decoder

Pre-trained

VGG-Face

Audio feature

extractor

Eyebrow decoder

Mouth decoder

Mouth branch

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

1D

BatchNorm

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block(BN)

Input

Output

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Conv1D

Kernel size = 5

Stride = 1

Padding = 2

Relu

Res1D block

Input

Output

Add Add

FC layer

Relu

FC layer

Relu

FC layer FC layer

Mean Std

AdaIN block

Input

Conv1D

Kernel size = 3

Stride = 2

Padding = 1

Relu

Res1D

block(Down)

Input

Output

Add

Conv1D

Kernel size = 1

Stride = 0

Padding = 0

1D

BatchNorm

Relu

Res1D

block(Up)

Input

Output

Add

Up sample

Linear

interpolation

(size x 2)

1D

BatchNorm

Up sample

Linear

interpolation

(size x 2)

Conv1D

Kernel size = 3

Stride = 1

Padding = 1

Res1D block(BN)

Layer num:18

In_channel:256

Out_hannel:256

(a) (b) (c) (d) (e)

AdaIN block

In_channel:512

Hidden_channel:256

Out_channel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 15

Out_channel = 256

(f)

Res1D block

In_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block

In_channel:256

Out_channel:256

AdaIN block

In_channel:256

Hidden_channel:256

Out_channel:256

Res1D block(Down)

Layer num:4

In_channel:256

Out_channel:256

Res1D block(Up)

Layer num:4

In_channel:256

Out_channel:256

Long-time temporal decoder

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 5

Res1D block(BN)

Layer num:8

In_channel:256

Out_hannel:256

Conv1D

Kernel size = 1

Stride = 1

Padding = 0

In_channel = 256

Out_channel = 28

Eyebrow/head branch

(g)

Figure 1. The structural details of style-specific animation generator Gani.

2

Page 3: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Conv1D

Layer_num = 1

C_in:43/20

C_out:256

Kernel = 1

Stride = 1

Pad = 0

Relu

Conv1D

Layer_num = 6

C_in:256

C_out:256

Kernel = 5

Stride = 2

Pad = 2

Relu

Conv1D

Layer_num = 2

C_in:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

Relu

Conv1D

Layer_num = 1

C_in:256

C_out:28/5

Kernel = 1

Stride = 1

Pad = 0

Conv1D

Layer_num = 1

C_in:43/20

C_out:256

Kernel = 1

Stride = 1

Pad = 0

Relu

Conv1D

Layer_num = 6

C_in:256

C_out:256

Kernel = 5

Stride = 2

Pad = 2

Relu

Conv1D

Layer_num = 2

C_in:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

Relu

Conv1D

Layer_num = 1

C_in:256

C_out:28/5

Kernel = 1

Stride = 1

Pad = 0

Figure 2. The structural details of animation discriminator Dmou, Debro, Dhed.

Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)Neutral Face Eyebrow Down (left) Eyebrow Down (right)Eyebrow Up (left) Eyebrow Up (right)Eyebrow Up (center)

Figure 3. The eyebrow basis in 3DMM.

Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)

Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)

Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)

Lip stretch(left)

Lip stretch(right) Mouth frown(left)

Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel

Mouth left

Mouth right

Chin raise(lower) Chin raise(upper) Puff

Neutral Face Jaw open Lip together Jaw left Jaw right Jaw forward Lip up(left)

Lip up(right) Lip down(left) Lip down(right) Lip close(up)Lip close(down)

Mouth smile(left) Mouth smile(right) Mouth dimple(left) Mouth dimple(right)

Lip stretch(left)

Lip stretch(right) Mouth frown(left)

Mouth frown(right) Mouth press(left) Mouth press(right) Lip Pucker Lip Funnel

Mouth left

Mouth right

Chin raise(lower) Chin raise(upper) Puff

Figure 4. The mouth basis in 3DMM.

3

Page 4: Supplementary Materials...details are illustrated in Figure2. 3DMM. In our 3DMM, M 0 and fVs i g 60 i=1 are preserved as in LSFM[1]. We disentangle the mouth and eyebrow movements,

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#3760

CVPR#3760

CVPR 2021 Submission #3760. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Conv2D

Layer_num = 1

C_in:3

C_out:64

Kernel = 7

Stride = 1

Pad = 3

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu Matting

operation

Conv2D

Layer_num = 2

C_in:256

C_hidden:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Up+Conv2D

Linear Upsample x 2

Layer_num = 2

C_in:256

C_hidden:128

C_out:64

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Conv2D

Layer_num = 1

C_in:64

C_out:3

Kernel = 7

Stride = 1

Pad = 3

Result

image

Conv2D

Layer_num = 1

C_in:3

C_out:64

Kernel = 7

Stride = 1

Pad = 3

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu

Conv2D

Layer_num = 2

C_in:64

C_hidden:128

C_out:256

Kernel = 3

Stride = 2

Pad = 1

BN+

Relu Matting

operation

Conv2D

Layer_num = 2

C_in:256

C_hidden:256

C_out:256

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Up+Conv2D

Linear Upsample x 2

Layer_num = 2

C_in:256

C_hidden:128

C_out:64

Kernel = 3

Stride = 1

Pad = 1

BN+

Relu

Conv2D

Layer_num = 1

C_in:64

C_out:3

Kernel = 7

Stride = 1

Pad = 3

Result

image

Figure 5. The structural details of encoder-decoder network in Gvid.

4