animating your life: real-time video-to-animation translation

Animating Your Life: Real-Time Video-to-Animation Translation∗

Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian and Tao MeiUniversity of Science and Technology of China, Hefei, China

JD AI Research, Beijing, [email protected];{panyw.ustc,tingyao.ustc}@gmail.com;[email protected];[email protected]

ABSTRACTWe demonstrate a video-to-animation translator, which can trans-form real-world video into cartoon or ink-wash animation in real-time. When users upload a video or record what they are seeingwith the phone, the video-to-animation translator renders the livestreaming video with cartoon or ink-wash animation style whilemaintaining the original contents. We formulate this task as video-to-video translation problem in the absence of any paired trainingexamples, since the manual labeling of such paired video-animationdata is cost-expensive and even unrealistic in practice. Technically,an unified unpaired video-to-video translator is utilized to exploreboth appearance structure and temporal continuity in video syn-thesis. As such, not only the visual appearance in each frame butalso motion between consecutive frames are ensured to be realisticand consistent for video translation. Based on these technologies,our demonstration can be conducted on any videos in the wild andsupports live video-to-animation translation, which engages userswith the animated artistic expression of their life.

CCS CONCEPTS• Information systems→Multimedia content creation; •Com-puting methodologies → Vision for robotics; Motion capture.

KEYWORDSVideo-to-Video Translation; GANs; Unsupervised Learning

ACM Reference Format:Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian and Tao Mei. 2019. Animat-ing Your Life: Real-Time Video-to-Animation Translation. In Proceedingsof the 27th ACM International Conference on Multimedia (MM ’19), Octo-ber 21–25, 2019, Nice, France. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3343031.3350593

1 INTRODUCTIONAnimation is a pervasive artistic expression in our daily life, whichcontains a series of animated types with non-realistic or semi-realistic style. For example, cartoon animation is a general anima-tion style for entertainment, commercial, and educational purposes.Moreover, ink-wash animation is a specific animation style which

∗This work was performed at JD AI Research.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).MM ’19, October 21–25, 2019, Nice, France© 2019 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-6889-6/19/10.https://doi.org/10.1145/3343031.3350593

live streaming video

cartoon animation

ink-wash animation

Figure 1: The work flow of our video-to-animation translator.

combines both Chinese traditional aesthetics of Shui-mo and mod-ern animation techniques. Nevertheless, manually creating anima-tion is cost-expensive and demands experts with sufficient artisticskills in animation. Recently, most researchers [4] have strived toautomatically transfer the style (e.g., colors and textures) of a refer-ence image to an input video, which is called video style transfer.However, the extension from such video style transfer into video-to-animation is not trivial especially when the cartoon/ink-washanimation style is a generic style from a collection of cartoon/ink-wash animations instead of a specific style in a single image.

Inspired from unpaired image-to-image translation [6–8, 14–16]which trains image translator across domains with unpaired data,we go one step further and formulate the video-to-animation trans-lation as unpaired video-to-video translation problem. As such, ageneral-purpose video translation from real-world domain to ani-mation domain in the absence of paired training data is enabled. Inthis demo, we present such a kind of video-to-animation translatorand Figure 1 depicts its work flow. In particular, users first capturelive streaming videos via mobile devices. The recorded video isfurther transmitted to the client of video-to-animation translator,where our system transforms the real-world video into cartoon orink-wash animation. Finally, the generated cartoon or ink-washanimation is displayed to users. Our video-to-animation translatorprovides the capability of real-time rendering of live streamingvideo with cartoon or ink-wash animation style and engaging userswith the animated artistic expression of what they are seeing.

Our video-to-animation translator is novel by enabling auto-matic translation from real-world videos into cartoon or ink-washanimation. To the best of our knowledge, the work represents thefirst effort towards this target in the multimedia research commu-nity. In addition, a real-time unpaired video-to-video translationframework is devised, which ensures both visual appearance in eachframe and the motion between consecutive frames to be realisticand consistent for video translation.

Demonstration I MM ’19, October 21–25, 2019, Nice, France

1068

https://doi.org/10.1145/3343031.3350593

https://doi.org/10.1145/3343031.3350593

https://doi.org/10.1145/3343031.3350593

Real video in real-word domain (X)

GX GY

OpticalFlow

DY

Temporal Loss

Motion Cycle ConsistencyLoss

Frame Cycle ConsistencyLoss

Warping OpticalFlow

real or fake ?

Fake video in cartoon domain (Y)

Reconstructed video

Figure 2: Training stage of our video-to-animation translator.

2 VIDEO-TO-ANIMATION TRANSLATOR2.1 DatasetsTo substantially train and evaluate our system, we construct anew large video dataset for video-to-animation translation. Thedataset consists of videos from three different domains: real-worldvideo, cartoon animation, and ink-wash animation. For real-worldvideo, we collect 200 minutes videos from YouTube, which aremainly captured by mobile phones or GoPro cameras. Moreover,200 minutes cartoon animations (i.e. Doraemon and Your Name)and 100 minutes ink-wash animations (i.e. Feelings of Mountainsand Waters, Buffalo Boy and the Flute, Snipe-Clam Grapple, andLu Ling) are taken as training videos in cartoon and ink-washanimation style, respectively. Note that we exclude the openingand ending in each animation to reduce the training noise.

2.2 Training StageGiven the video taken from a live camera or uploaded with an enduser, our goal is to transform it into cartoon or ink-wash animation.Inspired by the recent success of Cycle-GAN [15, 16] in unpairedimage-to-image translation and temporal coherence/dynamics ex-ploration in video understanding [2, 9, 10, 12], we formulate ourunpaired video-to-video translation model in a cyclic paradigmwhich enforces the learnt mappings to be cycle consistent on bothframe and motion. The whole training architecture of our modelis illustrated in Figure 2. Specifically, our model consists of twogenerators (GX , GY ) to synthesize frames across domains, andtwo discriminators (DX , DY ) which distinguish real frames fromsynthetic ones in each domain. Given two consecutive frames inreal-world domain, we firstly translate them into the syntheticframes in cartoon domain via GX , which are further transformedinto the reconstructed frames through the inverse mapping GY .In addition, two optical flow images between consecutive inputframes and reconstructed frames are obtained by capitalizing onFlowNet [5] to represent the motion before and after the forwardcycle, respectively.

Three kinds of spatial/temporal constraints are mainly leveragedto train our video-to-animation translator: 1) Adversarial Constraintensures each synthetic frame realistic at appearance through adver-sarial learning. As in image/video generation [3, 11, 13], the gener-ators and discriminators are adversarially trained in a two-playerminimax game mechanism. 2) Frame and Motion Cycle ConsistencyConstraints encourage an inverse translation on both frames andmotions. Specifically, frame cycle consistency constraint is exploited

Gink-wash Gink-wash Gink-washGcartoon Gcartoon Gcartoon

cartoon

ink-wash

Figure 3: Inference of our video-to-animation translator.to penalize the difference between the primary input frame andits reconstructed frame. Besides, we extend such cycle consistencyconstraint from single frame to motion between consecutive framesin the scenario of unpaired video translation. In this way, the es-timated optical flow between reconstructed consecutive frames isenforced to be similar to the primary optical flow between inputconsecutive frames. 3) Temporal Constraint is utilized to directlywarp the synthetic frame with the source motion into the subse-quent frame, aiming to enforce the pixel-wise temporal consistencyin cartoon domain. Note that to further enhance the cartoon stylein video translation, we follow [1] and integrate edge-promotingadversarial loss and content loss for training. The former encour-ages the generation of cartoon videos with clear edges and thelatter enforces the generated cartoon videos to retain the semanticcontents in input videos.

2.3 Inference and DemonstrationAfter the training of our translator on real-world videos & car-toon animations/real-world videos & ink-wash animations, we canobtain the learnt generatorGX (Gcar toon orGink−wash ) for video-to-cartoon animation or video-to-ink-wash animation, respectively.During inference, given an input video, the generator Gcar toon orGink−wash is directly employed to convert input video into thesynthetic video frame-by-frame, as depicted in Figure 3.

In the demonstration, a camera is utilized to keep capturing whatusers are seeing. Meanwhile the client of video-to-animation trans-lator is set up to receive the live streaming video and transform itinto cartoon or ink-wash animation, which will be finally displayedto users. The whole system is currently running on a regular PCwith 3.20GHz CPU, a single NVIDIA GeForce GTX 1070 GPU and16GB RAM. For each input frame, our video-to-animation trans-lator takes about 0.03 seconds in total, which supports the videotranslation in real time.

3 EVALUATIONTo quantitatively evaluate our video-to-animation translator, weconduct user study by inviting 10 labelers from different educationbackgrounds to annotate the generated animations for 50 real-worldvideos. We show each input real-world video with two translatedcartoon and ink-wash animations, and ask the labelers to evaluateeach translated animation in a five point ordinal scale (5: Excellent;4: Good; 3: Neutral; 2: Bad; 1: Very Bad). Here we only treat thetranslated animations with the scores more than 3 as the satisfyingones. According to all labelers’ feedback, the satisfying rate forcartoon and ink-wash animations are 72% and 68%, respectively.


1069

REFERENCES[1] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. 2018. Cartoongan: Generative adver-

sarial networks for photo cartoonization. In CVPR.[2] Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei.

2019. Relation Distillation Networks for Video Object Detection. In ICCV.[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarialnets. In NIPS.

[4] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu,Zhifeng Li, andWei Liu. 2017. Real-time neural style transfer for videos. In CVPR.

[5] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy,and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation withdeep networks. In CVPR.

[6] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017.Learning to discover cross-domain relations with generative adversarial networks.In ICML.

[7] Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. InNIPS.

[8] Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, andKwang In Kim. 2018. Unsupervised Attention-guided Image-to-Image Translation.

In NIPS.[9] Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016.

Learning Deep Intrinsic Video Representation by Exploring Temporal Coherenceand Graph Structure.. In IJCAI.

[10] Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. Seeingbot. In SIGIR.

[11] Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. To createwhat you tell: Generating videos from captions. In ACM MM.

[12] Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning withtransferred semantic attributes. In CVPR.

[13] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videoswith scene dynamics. In NIPS.

[14] Xuewen Yang, Dongliang Xie, and Xin Wang. 2018. Crossing-Domain Gener-ative Adversarial Networks for Unsupervised Multi-Domain Image-to-ImageTranslation. In ACM MM.

[15] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsuperviseddual learning for image-to-image translation. In ICCV.

[16] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpairedimage-to-image translation using cycle-consistent adversarial networks. In ICCV.


1070

animating your life: real-time video-to-animation translation

Documents