a simple method for high quality artist-driven lip syncing simple method for high quality... ·...

1
Copyright is held by the author/owner(s). I3D 2013, Orlando, FL, March 21 – 23, 2013. © 2013 ACM 978-1-4503-1956-0/13/0003 $15.00 A Simple Method for High Quality Artist-Driven Lip Syncing Yuyu Xu * Andrew W. Feng Institute for Creative Technologies Ari Shapiro Synchronizing the lip and mouth movements naturally along with animation is an important part of convincing 3D character performance. We present a simple, portable and editable lip- synchronization method that works for multiple languages, requires no machine learning, can be constructed by a skilled animator, runs in real time, and can be personalized for each character. Our method associates animation curves designed by an animator on a fixed set of static facial poses, with sequential pairs of phonemes (diphones), and then stitch the diphones together to create a set of curves for the facial poses. Diphone- and triphone-based methods have been ex- plored in various previous works [Deng et al. 2006], often requiring machine learning. However, our experiments have shown that di- phones are sufficient for producing high-quality lip syncing, and that longer sequences of phonemes are not necessary. Our experi- ments have shown that skilled animators can sufficiently generate the data needed for good quality results. Thus our algorithm does not need any specific rules about coarticulation, such as dominance functions [Cohen and Massaro 1993] or language rules. Such rules are implicit within the artist-produced data. In order to produce a tractable set of data, our method reduces the full set of 40 English phonemes to a smaller set of 21, which are then annotated by an an- imator. Once the full diphone set of animations has been generated, it can be reused for multiple characters. Each additional character requires a small set of eight static poses or blendshapes. In addi- tion, each language requires a new set of diphones, although similar phonemes among languages can share the same diphone curves. We show how to reuse our English diphone set to adapt to a Mandarin diphone set. Our method fits well within the video game, simulation, film and machinema pipelines. Our data can be regenerated by most pro- fessional animators, our stitching and smoothing algorithms are straightforward and simple to implement, and our technique can be ported to new characters by using facial poses consisting of either joint-based faces, blendshape poses, or both. In addition, edits or changes to specific utterances can be readily identified, edited and customized on a per-character basis. By contrast, many machine learning techniques are black box techniques that are difficult to edit in an intuitive way and require large data sets. Based on these characteristics, we believe that our method is well-suited for com- mercial pipelines. We openly provide our English and Mandarin data sets for download at: http://smartbody.ict.usc.edu/lipsynch/. We present a direct comparison with a popular commercial lip sync- ing engine, FaceFX, using identical models and face shapes, and present a study that shows that our method received higher scores * e-mail:[email protected] e-mail:[email protected] e-mail:[email protected] related to naturalness and accuracy. Figure 1: Curves for F-Ah diphone. The animator selects three facial poses; FV, open and wide, and constructs animation curves over normalized time. The animation can be directly played on the character, or the character could be driven using TTS or recorded audio. Figure 2: Comparison between our method (diphone) and FaceFX using both realistic and cartoony characters. We use Amazon Me- chanical Turk to collect viewer ratings from about 320 participants. References COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coartic- ulation in synthetic visual speech. In Models and Techniques in Computer Animation, Springer-Verlag, 139–156. DENG, Z., CHIANG, P.-Y., FOX, P., AND NEUMANN, U. 2006. Animating blendshape faces by cross-mapping motion capture data. In Proceedings of the 2006 symposium on Interactive 3D graphics and games, ACM, New York, NY, USA, I3D ’06, 43– 48. 181

Upload: others

Post on 06-Oct-2020

56 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Simple Method for High Quality Artist-Driven Lip Syncing Simple Method for High Quality... · 2013. 7. 15. · A Simple Method for High Quality Artist-Driven Lip Syncing Yuyu Xu∗

Copyright is held by the author/owner(s). I3D 2013, Orlando, FL, March 21 – 23, 2013. © 2013 ACM 978-1-4503-1956-0/13/0003 $15.00

A Simple Method for High Quality Artist-Driven Lip Syncing

Yuyu Xu∗ Andrew W. Feng†

Institute for Creative Technologies

Ari Shapiro‡

Synchronizing the lip and mouth movements naturally alongwith animation is an important part of convincing 3D characterperformance. We present a simple, portable and editable lip-synchronization method that works for multiple languages, requiresno machine learning, can be constructed by a skilled animator, runsin real time, and can be personalized for each character. Our methodassociates animation curves designed by an animator on a fixed setof static facial poses, with sequential pairs of phonemes (diphones),and then stitch the diphones together to create a set of curves for thefacial poses. Diphone- and triphone-based methods have been ex-plored in various previous works [Deng et al. 2006], often requiringmachine learning. However, our experiments have shown that di-phones are sufficient for producing high-quality lip syncing, andthat longer sequences of phonemes are not necessary. Our experi-ments have shown that skilled animators can sufficiently generatethe data needed for good quality results. Thus our algorithm doesnot need any specific rules about coarticulation, such as dominancefunctions [Cohen and Massaro 1993] or language rules. Such rulesare implicit within the artist-produced data. In order to produce atractable set of data, our method reduces the full set of 40 Englishphonemes to a smaller set of 21, which are then annotated by an an-imator. Once the full diphone set of animations has been generated,it can be reused for multiple characters. Each additional characterrequires a small set of eight static poses or blendshapes. In addi-tion, each language requires a new set of diphones, although similarphonemes among languages can share the same diphone curves. Weshow how to reuse our English diphone set to adapt to a Mandarindiphone set.

Our method fits well within the video game, simulation, film andmachinema pipelines. Our data can be regenerated by most pro-fessional animators, our stitching and smoothing algorithms arestraightforward and simple to implement, and our technique can beported to new characters by using facial poses consisting of eitherjoint-based faces, blendshape poses, or both. In addition, edits orchanges to specific utterances can be readily identified, edited andcustomized on a per-character basis. By contrast, many machinelearning techniques are black box techniques that are difficult toedit in an intuitive way and require large data sets. Based on thesecharacteristics, we believe that our method is well-suited for com-mercial pipelines. We openly provide our English and Mandarindata sets for download at: http://smartbody.ict.usc.edu/lipsynch/.We present a direct comparison with a popular commercial lip sync-ing engine, FaceFX, using identical models and face shapes, andpresent a study that shows that our method received higher scores

∗e-mail:[email protected]†e-mail:[email protected]‡e-mail:[email protected]

related to naturalness and accuracy.

Figure 1: Curves for F-Ah diphone. The animator selects threefacial poses; FV, open and wide, and constructs animation curvesover normalized time. The animation can be directly played on thecharacter, or the character could be driven using TTS or recordedaudio.

Figure 2: Comparison between our method (diphone) and FaceFXusing both realistic and cartoony characters. We use Amazon Me-chanical Turk to collect viewer ratings from about 320 participants.

References

COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coartic-ulation in synthetic visual speech. In Models and Techniques inComputer Animation, Springer-Verlag, 139–156.

DENG, Z., CHIANG, P.-Y., FOX, P., AND NEUMANN, U. 2006.Animating blendshape faces by cross-mapping motion capturedata. In Proceedings of the 2006 symposium on Interactive 3Dgraphics and games, ACM, New York, NY, USA, I3D ’06, 43–48.

181