generation of 3d human models and animations !!!!!!! using simple sketches · 2020. 9. 7. ·...

9
Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman * Koc ¸ University Yusuf Sahillio˘ glu Middle East Technical University T. Metin Sezgin Koc ¸ University Figure 1: Our framework is capable of performing three tasks. (a) It can generate 3D models from given 2D stick figure sketches. (b) It can generate dynamic 3D models, i.e., animations, between given source and target stick figures. (c) It can further extrapolate the produced 3D model sequence by using the learned interpolation vector. ABSTRACT Generating 3D models from 2D images or sketches is a widely studied important problem in computer graphics. We describe the first method to generate a 3D human model from a single sketched stick figure. In contrast to the existing human modeling techniques, our method requires neither a statistical body shape model nor a rigged 3D character model. We exploit Variational Autoencoders to develop a novel framework capable of transitioning from a simple 2D stick figure sketch, to a corresponding 3D human model. Our network learns the mapping between the input sketch and the output 3D model. Furthermore, our model learns the embedding space around these models. We demonstrate that our network can generate not only 3D models, but also 3D animations through interpolation and extrapolation in the learned embedding space. Extensive experi- ments show that our model learns to generate reasonable 3D models and animations. Index Terms: Computing methodologies—Neural networks—;— —Computing methodologies—Learning latent representations—; * e-mail: [email protected] e-mail: [email protected] e-mail: [email protected] Computing methodologies—Shape modeling—;— 1 I NTRODUCTION 3D content is still not as big as image and video data. One of the main reasons of this lack of abundance is the labor going into the creation process. Despite the increasing number of talented artists and automated tools, it is obviously not as simple and quick as hitting a record button on the phone. 3D content is, on the other hand, as important as the image and video data since it is used in many useful pipelines ranging from 3D printing to 3D gaming and filming. With these considerations in mind, we aim to make the important 3D content creation task simpler and faster. To this end, we train a neural network over 2D stick figure and corresponding 3D model pairs. Utilization of easy-to-sketch and sufficiently-expressive 2D stick figures is a unique feature of our system that makes our system work properly even with a moderate amount of training, e.g., 72 distinct poses of a human model are used. We focus on human models only as they come forward in 3D applications. Given 2D human stick figure sketches, our algorithm is able to produce visually appealing 3D point cloud models without requiring any other input such as a rigged template model. After an easy and seamless tweaking in the network, the system is also capable of producing dynamic 3D models, i.e., animations, between source and target stick figures. Graphics Interface Conference 2020 28-29 May Copyright held by authors. Permission granted to CHCCS/SCDHM to publish in print and digital form, and ACM to publish electronically.

Upload: others

Post on 12-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

Generation of 3D Human Models and AnimationsUsing Simple Sketches

Alican Akman*

Koc UniversityYusuf Sahillioglu†

Middle East Technical UniversityT. Metin Sezgin‡

Koc University

Figure 1: Our framework is capable of performing three tasks. (a) It can generate 3D models from given 2D stick figure sketches. (b)It can generate dynamic 3D models, i.e., animations, between given source and target stick figures. (c) It can further extrapolate theproduced 3D model sequence by using the learned interpolation vector.

ABSTRACT

Generating 3D models from 2D images or sketches is a widelystudied important problem in computer graphics. We describe thefirst method to generate a 3D human model from a single sketchedstick figure. In contrast to the existing human modeling techniques,our method requires neither a statistical body shape model nor arigged 3D character model. We exploit Variational Autoencoders todevelop a novel framework capable of transitioning from a simple2D stick figure sketch, to a corresponding 3D human model. Ournetwork learns the mapping between the input sketch and the output3D model. Furthermore, our model learns the embedding spacearound these models. We demonstrate that our network can generatenot only 3D models, but also 3D animations through interpolationand extrapolation in the learned embedding space. Extensive experi-ments show that our model learns to generate reasonable 3D modelsand animations.

Index Terms: Computing methodologies—Neural networks—;——Computing methodologies—Learning latent representations—;

*e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]

Computing methodologies—Shape modeling—;—

1 INTRODUCTION

3D content is still not as big as image and video data. One of themain reasons of this lack of abundance is the labor going into thecreation process. Despite the increasing number of talented artistsand automated tools, it is obviously not as simple and quick ashitting a record button on the phone.

3D content is, on the other hand, as important as the image andvideo data since it is used in many useful pipelines ranging from 3Dprinting to 3D gaming and filming.

With these considerations in mind, we aim to make the important3D content creation task simpler and faster. To this end, we train aneural network over 2D stick figure and corresponding 3D modelpairs. Utilization of easy-to-sketch and sufficiently-expressive 2Dstick figures is a unique feature of our system that makes our systemwork properly even with a moderate amount of training, e.g., 72distinct poses of a human model are used. We focus on humanmodels only as they come forward in 3D applications.

Given 2D human stick figure sketches, our algorithm is able toproduce visually appealing 3D point cloud models without requiringany other input such as a rigged template model. After an easy andseamless tweaking in the network, the system is also capable ofproducing dynamic 3D models, i.e., animations, between source andtarget stick figures.

Graphics Interface Conference 2020 28-29 May Copyright held by authors. Permission granted to CHCCS/SCDHM to publish in print and digital form, and ACM to publish electronically.

Page 2: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

2 RELATED WORK

Thanks to their natural expressiveness power, sketches are commonmodes for interaction for various graphics applications [30, 39].

The majority of sketch-based 3D human modeling methods dealwith re-posing a rigged input model under the guidance of usersketches. [15] performs this action by transforming imaginary linesrunning down a character’s major bone chains, whereas [32] and [16]propose incremental schemes that pose characters one limb at a time.2D stick figures to pose characters benefit from user annotations[9], specific priors [26], and database queries [8,45]. Bessmeltsevet al. [5] claim that ambiguity problems of all these methods canbe alleviated by contour-based gesture drawing. Deep regressionnetwork of [17] utilizes contour drawing to allow face creation inminutes. Another system which takes one or more contour drawingsas its input uses deep convolutional neural networks to create varietyof 3D shapes [10]. We avoid the need for a rigged model in inputspecification and merely require a user sketch, which, when fed intoour network, produces the 3D model quickly in the specified pose.As the network is trained with the SCAPE models [3], our resulting3D shape looks like the SCAPE actor, i.e., a 30 year-old fit man.

There also exist sketch-based modeling methods for other spe-cific objects such as hairs [13, 37] and plants [2], as well as general-purpose methods that are not restricted to a particular object class.These generic methods, consequently, may not perform as accuratelyas their object-specific counterparts for those objects but still demon-strate impressive results. 3D free-form design by the pioneer Teddymodel [19] is improved in FiberMesh [29] and SmoothSketch [21]by removing the potential cusps and T-junctions with the addition offeatures such as topological shape reconstruction and hidden contourcompletion. The recent SymmSketch system [28] exploits symmetryassumption to generate symmetric 3D free-form models from 2Dsketches. In order to increase quality in generating 3D models, [42]focuses piecewise-smooth man-made shapes. Their deep neuralnetwork-based system infers a set of parametric surfaces that real-ize the drawing in 3D. Other solutions to the sketch-based generic3D model creation problem depend on image guidance [43, 47],snapping one of the predefined primitives to the sketch by fitting itsprojection to the sketch lines [41], and controlled curvature variationpatterns [25].

3D model generation and editing have been extended to 3D scenesas well. Dating back to 1996 [48], this line of works generally index3D model repositories by their 2D projections from multiple viewsand retrieve the elements that best match the 2D sketch query [23,40].Xu et al. [46] extend this idea further by jointly processing thesketched objects, e.g., while a single computer mouse sketch is noteasy to recognize, other input sketches such as computer keyboardmay provide useful cues. Sketch-based user interfaces arise in 2Dimage generation as well [7, 12].

Sketches also arise frequently in shape retrieval applications dueto their simplicity and expressiveness. Our focus, the human stickfigure sketch, has been used successfully in [35] to retrieve 3Dhuman models from large databases. The prominent example inthis domain [11] as well as the convolutional neural network basedmethod [44] report good performance with gesture drawings when itcomes to retrieving humans. These three methods, as well as manyother sketch-based retrieval methods [24], are in general successfulon retrieving non-human models as well.

Although human body types under the same pose can be learnedeasily with moderate amounts of data through statistical shape model-ing [1,38], this approach requires much greater amounts of input datato learn plausible shape poses under various deformations [18, 33].In addition to the data issue, this family of methods that are basedon statistical analysis of human body shape operate directly onvertex positions, which brings the disadvantage that rotated bodyparts have a completely different representation. This issue is ad-dressed with various heuristics, most successful of which leads to

the SMPL model [27] that enables 3D human extraction from 2Dimages [20, 31]. Our learning-based solution requires moderateamount of training data, and also alleviates the rotated part issue bysimply populating the input data with 17 other rotated versions ofeach model.

3 OVERVIEW

We have two main objectives: (i) Generating 3D models from asingle sketched stick figure, (ii) creating 3D animations betweentwo 3D models, generated from 2D source and target sketches. Inaddition, we present an application that allows interactive modelingusing our algorithm.

Our approach is powered by a Variational Autoencoder network(VAE). We train this network with pairs of 3D and 2D points. The3D points come from the SCAPE 3D human figure database, whilethe 2D points are obtained by projecting joint positions of thesemodels on a 2D surface. Hence the correspondence information ispreserved. Our neural network model ties the 2D and 3D representa-tions through a latent space, which allows us to generate 3D pointclouds from 2D stick figures.

The latent space that ties the 2D and 3D representations also actsas a convenient lower dimensional embedding for interpolation andextrapolation. Given a set of target key frames in the form of 2Ddrawings, we can map them into the lower dimensional embeddingspace, and then interpolate between them to obtain a number of viapoints in the embedding space. These via points can then be mappedto 3D through the network to obtain a smooth 3D animation se-quence. Furthermore, extrapolation allows extending the animationsequence beyond the target key frames.

4 METHODOLOGY

Our method aims to generate static and dynamic 3D human modelsfrom scratch, that is, we require only the 2D input sketch and noother data such as a rigged 3D model waiting to be reposed. Tomake this possible, we learn a model that maps 2D stick figures to3D models.

4.1 Training Data GenerationThe original SCAPE dataset consists of72 key 3D meshes of a human actor [3].It also contains point-to-point correspon-dence information between these dis-tinct model poses. We use a simple algo-rithm to extend this dataset by rotatingthe existing figures with different angles.First, we determine the axes and the an-gles of the rotation with respect to theoriginal coordinate system shown in thewrapped figure. We ignore the rotationwith respect to the x-axis, since stick figures are less likely to bedrawn from this view. Next, we rotate the models with respect tothe y-axis and z-axis, in a range of -90 degree to 90 at intervalsof 30 degrees for the y-axis, and -45 to 45 degrees at intervals of45 degrees for the z-axis. In the end of this process, we output 21models per key model in the SCAPE dataset.

Since our network is trained with (2D joints, 3D model) pairs, wealso extract 2D joints from a 3D model in a particular perspective.We designate 11 essential points that alone can describe a 3D humanpose. These are the following: forehead, elbows, hands, neck,abdomen, knees and feet. Since the dataset has the point-to-pointcorrespondence information in itself, we select these 3D points ina pilot mesh from the dataset. We project these joints onto a 2Dcamera plane (x-y in our case) across our entire dataset to create 2Djoint projections. In order to be independent from the coordinatesystem, we represent these points with relative positions (∆x, ∆y).We determine a specific order in 2D points forming a sketching path

Page 3: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

Figure 2: Our neural network architecture. (a) Train Network: We train this network with (3D point cloud, 2D points of stick figure) pairs duringtraining time. It consists of a VAE: Encoder3D and Decoder consecutively, and another external encoder: Encoder2D. We use regression lossfrom the output of Encoder2D to the mean vector of the VAE in addition to standard losses of VAE. (b) Test Network: We remove Encoder3D andreparameterization layer from our VAE and use Encoder2D-Decoder as our network in our experiments.

Figure 3: Screenshots from our user interface. (a) 3D model genera-tion mode. (b) 3D animation generation mode.

with 17 points (some joints are visited twice but in reverse direction).This sketching path determines the order of 2D points that form theinput vector of our neural network. The input vector format alsohandles front/back ambiguity while generating 3D models. We setthe first point in the sketching path as the origin, (0,0) and then weset the remaining points with respect to their relative position to thepreceding point.

4.2 Neural Network ArchitectureWe build upon the work of Girdhar et al. [14] while designing ourneural network architecture. Girdhar et al. aim to learn a vectorrepresentation that is generative, i.e., it can generate 3D models inthe form of voxels; and predictable, i.e., it can be predicted by a2D image. We utilize variational autoencoders rather than standardautoencoders to build our neural network as shown in Figure 2.Unlike standard autoencoders, VAEs are generative networks whoselatent distribution is regularized during the training in order to beclose to a standard Gaussian distribution. This property of VAEsensures that its latent distribution has a meaningful organizationwhich allows us to generate novel 3D models by sampling in thisdistribution. In addition to generating novel 3D models, since ourframework is capable of learning the vector space around these3D models, it enables meaningful transitions between them andextrapolations beyond them.

For our training network, we have two encoders and one decoder:Encoder3D, Encoder2D, and Decoder. Encoder3D and Decodertogether serve as a VAE. Our VAE takes in a 3D point cloud asinput, and reconstructs the same model as the output. While ourVAE learns to reconstruct 3D models, it forces latent distributionof the dataset to approximate normal distribution which makes thelatent space interpolatable. Meanwhile, we use our Encoder2D topredict latent vectors of corresponding 3D models from 2D points.In order to provide our latent distribution with similarity informationbetween 3D models, we design this partial architecture for our neuralnetwork instead of using a VAE which directly generates 3D modelsfrom 2D sketches. Thus, our Encoder3D is capable of learningrelations between 3D models rather than 2D sketches while creating

Page 4: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

a regularized latent distribution. With this method, we aim to explorelatent space better and generate more meaningful transitions between3D models.

Encoder3D-Decoder VAE Network Architecture:

Our VAE takes 12500× 3 point cloud of human 3D model asinput. Encoder3D contains two fully connected layers and its outputsare a mean vector and a deviation vector. We use ReLu as anactivation layer in all the internal layers. There is no activation layerin the output layers. Our Decoder takes the latent vector z as input.It also consists of two fully connected layers with a ReLu activationlayer and one fully connected output layer with a tanh activationlayer. It gives a reconstructed point cloud of the input 3D model asthe output.

We train our VAE with the standard KL divergence and recon-struction losses. The total loss for our VAE is given in Equation 1.

LVAE = α1

BN

B

∑i=1

N

∑j=1

(xi j− xi j)2 +DKL(q(z|x)||p(z)) (1)

In Equation 1, α is a parameter to balance between the recon-struction loss and KL divergence loss, B is the training batch size, Nis the 1D dimension of vectorized point cloud (37500 in our case),xi j is the j-th dimension of i-th model in the training data batch, xi jis the j-th dimension of model i’s output from our VAE, z is thereparameterized latent vector, p(z) is the prior probability, q(z| f ) isthe posterior probability, and DKL is KL divergence.

Mapping 2D Sketch Points to Latent Vector Space:

Our Encoder2D learns to predict the latent vectors of 3D modelsthat corresponds to 2D sketches as discussed. It takes 11×2 pointsas input to map it into mean vector. It has the same structure withEncoder3D except its input and internal fully connected layers’dimensions. In the test case we use Encoder2D and Decoder asa standard autoencoder. Decoder takes the mean vector output ofEncoder2D as its input and generates 3D point cloud as the output.

We train our Encoder2D with mean square loss to regress 256Drepresentation of mean vector given by pre-trained Encoder3D. Theloss for our Encoder2D is given in Equation 2.

L2D =1

BZ

B

∑i=1

Z

∑j=1

(µ1i −µ

2i )

2 (2)

In Equation 2, B is the training batch size, Z is the dimensionof latent space (256 in our case), µ1

i j is the j-th dimension of meanvector produced by Encoder3D to i-th model in training batch, andµ2

i j is the j-th dimension of mean vector produced by Encoder2D toi-th model in the training batch.

4.3 Training DetailsWe follow a three-stage process to train our network. (i) We trainour variational auto encoder independently w ith the loss functionin Equation (1). We run this stage for 300 epochs. (ii) We train ourEncoder2D with the loss function in Equation (2) using Encoder3Dtrained from (i). Specifically, we train our Encoder2D to regress thelatent vector produced from the pre-trained Encoder3D for the input3D model. We run this stage for 300 epochs. (iii) We use both lossesjointly to finetune our framework. We run this stage for 200 epochs.It takes about two days to complete whole training session.

For the experiments throughout this paper, we set α = 105. Weset the prior probability over latent variables as a standard normaldistribution, p(z) =N (z;0, I). We set the learning rate as 10−3. Weuse the Adam Optimizer as our optimizer in training.

4.4 User Interface

As we have explained in prior sections, our method can be used ina variety of applications in different fields such as character gener-ation and making quick animations. In order to better utilize theseapplications we propose a user interface with facilitative proper-ties that enables users to perform our method in a better manner(Figure 3). Our user interface acts as an agent that ensures the input-output communication between the user and our neural network.The user can choose whether to generate 3D models from 2D stickfigure sketches or create animations between the source and targetstick figure sketches. While it takes about one second to processa sketch input for generation of the corresponding 3D model, thistime extends approximately to five seconds for producing animationbetween (interpolation) and beyond (extrapolation) sketch inputs.

Our user interface takes a sketch input from the user via an em-bedded 2D canvas (right part). The collected sketch is transformedinto a map of the joint locations as an input to our neural networkby requiring users to sketch in a predetermined path. The user in-terface then shows the 3D model output produced by the neuralnetwork on the embedded 3D canvas (left part). Although our outputis a 3D point cloud, for better visualization, our user interface uti-lizes the mesh information that already exists in the SCAPE dataset.Generated point clouds are consequently combined with this infor-mation to display 3D models as surfaces with appropriate renderingmechanisms such as shading and lighting.

Since we trained an abundance of neural networks until achiev-ing the best one with the optimal parameters, our user interfaceshowed two different 3D model outputs coming from two differentneural networks for comparisons during the development phase. Theinterface in Figure 3 belongs to our release mode where only thepromoted 3D output is displayed.

We construct our user interface such that it is purified from un-natural interaction tools such as buttons and menus. Generationprocess starts as soon as the last stroke is drawn without forcing theuser to hit the start button. We provide brief information in text thatdescribes the canvas organization. To make the interaction morefluent, we add a simple ”scribble” detector to understand the undooperation.

5 EXPERIMENTS AND RESULTS

In this section, we first evaluate our framework qualitatively andquantitatively. We evaluate three tasks performed by our framework.(i) Generating 3D models from 2D stick figure sketches. (ii) Gener-ating 3D animations between source and target stick figure sketches.(iii) Performing simple extrapolation beyond stick figures.

5.1 Framework Evaluation

Standard Autoencoder Baseline. To quantitatively justify our pref-erence on variational autoencoders, we design a standard autoen-coder (AE) baseline with similar dimensions and activation functions(Figure 4). We train this network for 300 epochs with Euclidean lossbetween generated 3D models and ground truth ones. We comparethe per-vertex reconstruction loss on validation set consisting held-out 3D models with our VAE network and standard AE network.The results on Table 1 shows that our VAE network outperformsAE network, exploiting latent space more efficiently to enhance thegeneration quality of novel 3D models.

Latent Dimensions. We evaluate different latent dimensionsfor our VAE framework using per-vertex reconstruction loss onvalidation set. The results on Table 2 show that using 256 dimensionsimproves generation quality compared to lower dimensions. Higherdimensions lead to overfitting. We use 256 dimensions for thefollowing experiments.

Page 5: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

Figure 4: Neural network architecture for standard autoencoder base-line.

Table 1: Per-vertex reconstruction error on validation data with differ-ent network architectures. VAE represents our method.

Method Z (Latent Dim.) Mean (x10−3) Std (x10−3)

AE 256 2.996 1.568VAE 256 2.118 2.598

Table 2: Per-vertex reconstruction error on validation data with differ-ent latent dimensions. 256 is our promoted latent space dimension.

Method Z (Latent Dim.) Mean (x10−3) Std (x10−3)

VAE 128 3.887 3.802VAE 256 2.118 2.598VAE 512 10.830 6.888

5.2 Generating 3D models from 2D stick figure sketchesTo evaluate the generation ability, we feed 2D stick figure sketchesas input to our framework. Our user interface is used for the sketch-ing activity. In Figure 5, we compare generation results for samplesketches with our VAE network and standard AE baseline. These re-sults show that standard AE network generates models with anatomi-cal flaws (collapsed arm in Figure 5) and deficient body orientations.Our VAE network produces compatible models of high quality.

We also compare the generation ability of our framework with arecent study [20] that predicts 3D shape from 2D joints. While ourframework outputs 3D point clouds, the method described in [20],tries to fit a statistical body shape model, SMPL [6], on 2D joints.Their learned statistical model is a male who is in a different shapethan our male model as shown in the second column of Figure 6. Wetake the visuals reported in their paper and draw the correspondingstick figures for a fair comparison. Despite being restricted to thereported poses in [20], our method compares favorably. The sittingpose, for instance, which is not quite captured by their statisticalmodel shows in an inaccurate 3D sitting while our 3D model sitssuccessfully (Figure 6 - top row).

We finally compare our 3D human model generation ability to theground-truth by feeding sketches that resemble 3 of the 72 SCAPEposes used during training. Two of these poses were observed duringtraining at the same orientations as we draw them in the first tworows of Figure 7, yet the last one is drawn with a new orientation thatwas not observed during training (Figure 7-last row). Consequently,we had to align the generated model with the ground-truth modelvia ICP alignment [4] for the last row prior to the comparison. Weobserve almost perfect replications of the SCAPE poses in all casesas depicted by the colored difference maps.

5.3 Generation of Dynamic 3D Models - InterpolationWe test the capability of our network to generate 3D animationsbetween source and target stick figures. To accomplish this, ourframework takes two stick figure sketches via our user interface. Theframework then encodes the stick figures into the embedding spaceto extract their latent variables. We create a list of latent variablesby linearly interpolating between the source and the target latentvariables. We feed this list as an input to our decoder, with eachelement of the list being fed one by one to produce the desired 3Dmodel sequence. In Figure 8 (a), we compare our results with directlinear interpolation between source and target 3D model output. Ourresults show that the interpolation in embedding space can avoidanatomical errors which usually occur in methods using direct linearinterpolation. Further interpolation results can be found in the teaserimage and the accompanying video.

5.4 Generation of Dynamic 3D models - ExtrapolationOur results show that the interpolation vector in the embedding spaceis capable of reasonably representing the motion in real world ac-tions. To improve upon this idea, we exploit the learned interpolationvector in order to predict future movements. We show our resultsfor extrapolation in Figure 8 (b). This figure shows that the learnedinterpolation vector between two 3D shapes contains meaningfulinformation of movement in 3D. Further extrapolation results canbe found in the teaser image and the accompanying video.

5.5 TimingWe finish our evaluations with our timing information. The closestwork to our framework reports 10 minutes of production time foran amateur user to create 3D faces from 2D sketches [17]. In oursystem, on the other hand, it takes an amateur user 10 seconds ofstick figure drawing and 1 second of algorithm processing to create3D bodies from 2D sketches. There are two main reasons of thisremarkable performance advantage of our framework: i) Humanbodies can be naturally represented with easy-to-draw stick figures

Page 6: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

Figure 5: Generated 3D models in front and side views for given sketches using our VAE network and standard AE network. Flaws are highlightedwith red circles.

Figure 6: Qualitative comparison of our method (columns 3 and 4)with [20] (columns 1 and 2).

Figure 7: Generation results for (a) two sketches observed in trainingand (b) an unobserved sketch. The generated models (last column)are colored with respect to the distance map to the ground truth (firstcolumn). Distance values are normalized between 0 and 1.

Page 7: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

Figure 8: Qualitative comparison with linear interpolation. (a) Produced 3D model sequences for given sketches using our network are better thanlinear interpolation results. (b) Extrapolation results for 3D model sequences are given as red models.

whereas faces cannot. The simplicity and expressiveness make thelearning easier and more efficient. ii) Our deep regression networkis significantly less complex than the one employed in [17].

Our application runs on a PC with 8GB ram and i7 2.80GHz CPU.Training of our model is done on Tesla K40m GPU and took about2 days (800 epochs total).

6 CONCLUSIONS

In this paper, we presented a deep learning based framework that iscapable of generating 3D models from 2D stick figure sketches andproducing dynamic 3D models between (interpolation) and beyond(extrapolation) two given stick figure sketches. Unlike existing meth-ods, our method requires neither a statistical body shape model nora rigged 3D character model. We demonstrated that our frameworknot only gives reasonable results on generation, but also comparesfavorably with existing approaches. We further supported our frame-work with a well-designed user interface to make it practical for avariety of applications.

7 LIMITATIONS

The proposed system has several limitations that are listed as follows:

• Training of our network is dependent on the existing 3D shapesin the dataset. Our network cannot learn vastly different shapesthan existing ones: it produces incompatible 3D models withsketch inputs that are not closely represented in the dataset.For example, our network does not correctly capture the armorientation for the right model in Figure 9.

• The system can only generate human shapes because of thecontent of the dataset.

• The system can produce articulated shapes. Although it cantwist and bend human body in a reasonable way, it can not, forinstance, stretch or resize a body part.

• The system benefits from the one-to-one correspondence infor-mation of the dataset. Thus, the quality of results depends onthis information.

Figure 9: Failure cases of our framework. Our framework generatesanatomically unnatural results if the input sketch has disproportionatebody parts (left), or it is significantly different than the ones usedduring training (right).

• Since our network takes its input in a specific order, our userinterface constrains users to sketch in that order. Users cannotsketch stick figures in an arbitrary order.

8 FUTURE WORK

Potential future work directions that align with our proposed systemare described as follows. Human stick figures used in this papercan be generalized to any other shape using their skeletons. Conse-quently, an automated skeleton extraction algorithm would enablefurther training of our network, which in turn extends our solutionto non-human objects. Voxelization of our input data would spareus from the one-to-one correspondence information requirement,which in turn would enable our interpolation scheme to morph fromdifferent object classes that do not often have this type of informa-tion, e.g., from cat to giraffe. Automatic one-to-one correspondencecomputation [22,34,36] can also be considered to avoid voxelization.Latent space can be exploited in a better manner in order to obtaina more sophisticated extrapolation algorithm than the basic one weintroduced in this paper. New sketching cues can be designed and

Page 8: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

incorporated into our network to be able to produce body typesdifferent than the one used during training, e.g., training with the fitSCAPE actor and production with a obese actress.

ACKNOWLEDGMENTS

• This work was supported by TUBITAK under the projectEEEAG-115E471.

• This work was supported by European Commission ERA-NETProgram, The IMOTION project.

REFERENCES

[1] B. Allen, B. Curless, and Z. Popovic. The space of human bodyshapes: reconstruction and parameterization from range scans. ACMtransactions on graphics (TOG), 22(3):587–594, 2003.

[2] F. Anastacio, M. C. Sousa, F. Samavati, and J. A. Jorge. Modelingplant structures using concept sketches. In Proceedings of the 4th inter-national symposium on Non-photorealistic animation and rendering,pp. 105–113. ACM, 2006.

[3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, andJ. Davis. Scape: Shape completion and animation of people. ACMTrans. Graph., 24(3):408–416, July 2005. doi: 10.1145/1073204.1073207

[4] P. Besl and N. McKay. Method for registration of 3-d shapes. SensorFusion IV: Control Paradigms and Data Structures, 1611, 1992.

[5] M. Bessmeltsev, N. Vining, and A. Sheffer. Gesture3d: posing 3dcharacters via gesture drawings. ACM Transactions on Graphics (TOG),35(6):165, 2016.

[6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J.Black. Keep it SMPL: Automatic estimation of 3D human pose andshape from a single image. In Computer Vision – ECCV 2016, LectureNotes in Computer Science. Springer International Publishing, Oct.2016.

[7] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. Sketch2photo:Internet image montage. ACM Transactions on Graphics (TOG),28(5):124, 2009.

[8] M. G. Choi, K. Yang, T. Igarashi, J. Mitani, and J. Lee. Retrieval andvisualization of human motion data via stick figures. In ComputerGraphics Forum, vol. 31, pp. 2057–2065. Wiley Online Library, 2012.

[9] J. Davis, M. Agrawala, E. Chuang, Z. Popovic, and D. Salesin. Asketching interface for articulated figure animation. In Proceedingsof the 2003 ACM SIGGRAPH/Eurographics symposium on Computeranimation, pp. 320–328. Eurographics Association, 2003.

[10] J. Delanoy, M. Aubry, P. Isola, A. Efros, and A. Bousseau. 3d sketchingusing multi-view deep volumetric prediction. Proceedings of the ACMon Computer Graphics and Interactive Techniques, 1(21), may 2018.

[11] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa.Sketch-based shape retrieval. ACM Trans. Graph., 31(4):31–1, 2012.

[12] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and M. Alexa. Pho-tosketcher: interactive sketch-based image synthesis. IEEE ComputerGraphics and Applications, 31(6):56–66, 2011.

[13] H. Fu, Y. Wei, C.-L. Tai, and L. Quan. Sketching hairstyles. In Pro-ceedings of the 4th Eurographics workshop on Sketch-based interfacesand modeling, pp. 31–36. ACM, 2007.

[14] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning apredictable and generative vector representation for objects. CoRR,abs/1603.08637, 2016.

[15] M. Guay, M.-P. Cani, and R. Ronfard. The line of action: an intu-itive interface for expressive character posing. ACM Transactions onGraphics (TOG), 32(6):205, 2013.

[16] F. Hahn, F. Mutzel, S. Coros, B. Thomaszewski, M. Nitti, M. Gross, andR. W. Sumner. Sketch abstractions for character posing. In Proceedingsof the 14th ACM SIGGRAPH/Eurographics Symposium on ComputerAnimation, pp. 185–191. ACM, 2015.

[17] X. Han, C. Gao, and Y. Yu. Deepsketch2face: A deep learning basedsketching system for 3d face and caricature modeling. ACM Transac-tions on Graphics, 36(4):126, 2017.

[18] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Seidel. Astatistical model of human pose and body shape. Computer GraphicsForum, 28(2):337–346, 2009.

[19] T. Igarashi, S. Matsuoka, and H. Tanaka. Teddy: a sketching interfacefor 3d freeform design. In Proceedings of the 26th annual conferenceon Computer graphics and interactive techniques, pp. 409–416. ACMPress/Addison-Wesley Publishing Co., 1999.

[20] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-endrecovery of human shape and pose. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2018.

[21] O. A. Karpenko and J. F. Hughes. Smoothsketch: 3d free-formshapes from complex sketches. ACM Transactions on Graphics (TOG),25(3):589–598, 2006.

[22] V. G. Kim, Y. Lipman, and T. Funkhouser. Blended intrinsic maps.ACM Transactions on Graphics (TOG), 30(4):79, 2011.

[23] J. Lee and T. A. Funkhouser. Sketch-based search and composition of3d models. In SBM, pp. 97–104, 2008.

[24] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher,H. Fu, T. Furuya, H. Johan, et al. Shrec’14 track: Extended large scalesketch-based 3d shape retrieval. In Eurographics workshop on 3Dobject retrieval, vol. 2014. ., 2014.

[25] C. Li, H. Pan, Y. Liu, X. Tong, A. Sheffer, and W. Wang. Bendsketch:modeling freeform surfaces through 2d sketching. ACM Transactionson Graphics, 36(4):125, 2017.

[26] J. Lin, T. Igarashi, J. Mitani, M. Liao, and Y. He. A sketching interfacefor sitting pose design in the virtual environment. IEEE transactionson visualization and computer graphics, 18(11):1979–1991, 2012.

[27] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black.Smpl: A skinned multi-person linear model. ACM Transactions onGraphics (TOG), 34(6):248, 2015.

[28] Y. Miao, F. Hu, X. Zhang, J. Chen, and R. Pajarola. Symmsketch: Cre-ating symmetric 3d free-form shapes from 2d sketches. ComputationalVisual Media, 1(1):3–16, 2015.

[29] A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa. Fibermesh: designingfreeform surfaces with 3d curves. ACM transactions on graphics(TOG), 26(3):41, 2007.

[30] L. Olsen, F. F. Samavati, M. C. Sousa, and J. A. Jorge. Sketch-basedmodeling: A survey. Computers & Graphics, 33(1):85–103, 2009.

[31] M. Omran, C. Lassner, G. Pons-Moll, P. V. Gehler, and B. Schiele.Neural body fitting: Unifying deep learning and model based humanpose and shape estimation. In 2018 International Conference on 3DVision, 3DV 2018, Verona, Italy, September 5-8, 2018, pp. 484–494,2018. doi: 10.1109/3DV.2018.00062

[32] A. C. Oztireli, I. Baran, T. Popa, B. Dalstein, R. W. Sumner, andM. Gross. Differential blending for expressive sketch-based posing. InProceedings of the 12th ACM SIGGRAPH/Eurographics Symposiumon Computer Animation, pp. 155–164. ACM, 2013.

[33] L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele. Build-ing statistical shape spaces for 3d human modeling. Pattern Recogni-tion, 67:276–286, 2017.

[34] Y. Sahillioglu. A genetic isometric shape correspondence algorithmwith adaptive sampling. ACM Transactions on Graphics (TOG),37(5):175, 2018.

[35] Y. Sahillioglu and M. Sezgin. Sketch-based articulated 3d shape re-trieval. IEEE computer graphics and applications, 37(6):88–101, 2017.

[36] Y. Sahillioglu and Y. Yemez. Coarse-to-fine isometric shape corre-spondence by tracking symmetric flips. Computer Graphics Forum,32(1):177–189, 2013.

[37] S. Seki and T. Igarashi. Sketch-based 3d hair posing by contour draw-ings. In Proceedings of the ACM SIGGRAPH/Eurographics Symposiumon Computer Animation, p. 29. ACM, 2017.

[38] H. Seo and N. Magnenat-Thalmann. An example-based approach tohuman body manipulation. Graphical Models, 66(1):1–23, 2004.

[39] T. M. Sezgin, T. Stahovich, and R. Davis. Sketch based interfaces:early processing for sketch understanding. p. 22, 2006.

[40] H. Shin and T. Igarashi. Magic canvas: interactive design of a 3-dscene prototype from freehand sketches. In Proceedings of GraphicsInterface 2007, pp. 63–70. ACM, 2007.

[41] A. Shtof, A. Agathos, Y. Gingold, A. Shamir, and D. Cohen-Or. Geose-mantic snapping for sketch-based modeling. Computer graphics forum,32(2pt2):245–253, 2013.

[42] D. Smirnov, M. Bessmeltsev, and J. Solomon. Deep sketch-basedmodeling of man-made shapes. ArXiv, abs/1906.12337, 2019.

Page 9: Generation of 3D Human Models and Animations !!!!!!! Using Simple Sketches · 2020. 9. 7. · Generation of 3D Human Models and Animations Using Simple Sketches Alican Akman* Koc¸

[43] S. Tsang, R. Balakrishnan, K. Singh, and A. Ranjan. A suggestiveinterface for image guided 3d sketching. In Proceedings of the SIGCHIconference on Human Factors in Computing Systems, pp. 591–598.ACM, 2004.

[44] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval usingconvolutional neural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 1875–1883, 2015.

[45] X. K. Wei, J. Chai, et al. Intuitive interactive human-character pos-ing with millions of example poses. IEEE Computer Graphics andApplications, 31(4):78–88, 2011.

[46] K. Xu, K. Chen, H. Fu, W.-L. Sun, and S.-M. Hu. Sketch2scene: sketch-based co-retrieval and co-placement of 3d models. ACM Transactionson Graphics (TOG), 32(4):123, 2013.

[47] C. Yang, D. Sharon, and M. van de Panne. Sketch-based modeling ofparameterized objects. In SIGGRAPH Sketches, p. 89. Citeseer, 2005.

[48] R. Zeleznik, K. Herndon, and J. Hughes. Sketch: an interface forsketching 3d scenes. In Proceedings of the SIGGRAPH, 1996.