visual feedback for core training with 3d human shape and...

Visual Feedback for Core Training with 3D Human Shape and Pose

Haoran Xie, Atsushi Watatani and Kazunori MiyataJapan Advanced Institute of Science and Technology

Ishikawa, Japan

Figure 1: We propose a visual feedback system for core training. The proposed system can guide the user in achieving the target posefrom an input image. The system only requires a standard web camera to capture current pose in 3D and estimates the 3D human shapefor further guidance.

Abstract—In this paper, we propose a visual feedback systemfor core training using a monocular camera image. To supportthe user in maintaining the correct postures from target poses,we adopt 3D human shape estimation for both the targetimage and input camera video. Because it is expensive tocapture human pose using depth cameras or multiple camerasusing conventional approaches, we employ the skinned multi-person linear model of human shape to recover the 3D humanpose from 2D images using pose estimation and human meshrecovery methods. We propose a user interface for providingvisual guidance based on the estimated target and currenthuman shapes. To clarify the differences between the targetand current postures of 3D models, we adopt markers forvisualization at ten body parts with color changes. From userstudies conducted, the proposed visual feedback system iseffective and convenient in performing core training.

Keywords-Pose Estimation; User Interface; Core Training;3D Human Shape;

I. INTRODUCTION

In the fields of computer vision and computer graphics,pose estimation from a single image has been exploredintensively. Particularly, the recent advances in deep learningapproaches such as OpenPose [1] and DensePose [2] arehelping to speed up research on this topic greatly. The ex-isting training support systems such as training system fromFujitsu using 3D sensing technology are too expensive toacquire by common users. Furthermore, the existing systemsare developed for professional sports training and the userinterfaces are usually complex and difficult to control. Tosolve these issues, we aim to provide a user interface forvisual feedback in sports training using state-of-the-art pose

estimation approaches. In contrast to 3D sensing using lasersensors or depth cameras, the proposed system only requiresa single image from a standard web camera.

In this paper, we particularly focus on the support of coretraining, a popular training style for the young. Human coreplays an important role in body movement and balance andit is the central part of the human body. The training ofhuman trunk muscles is executed and emphasized by theprofessional athletes and it has been accepted by the generalpublic nowadays. According to the fitness report from theWorldwide Survey of Fitness Trends, core training has beenthe most popular training style being practiced in the worldsince 2010 [3].

Core training is recognized as a practical and simpletraining method and can be completed indoor on an in-dividual basis every day. The training method is effectivein training body muscles around trunk parts. To maximizethe training effect of core training, it is very important tomaintain correct poses [4]. Training in an incorrect pose canlead to injury. However, it is difficult for an individual toachieve the correct pose in training. To solve this issue, themotivation in carrying out this study is to guide the user toachieve a target pose from the current pose estimated froma single camera image, as shown in Figure 1. Therefore, wepropose a straightforward and simple user interface for useby an individual in performing core training.

In the training user interface, we adopt a skinned multi-person linear (SMPL) model of human shape to show thedifferences between target and current poses, as shown in

Figure 2: Proposed user interface for providing a visual feedbackin core training.

Figure 2. In the previous work about the training supportsystem, they usually utilized depth sensor or multiple cam-eras in capturing the user pose [5]–[7]. In contrast to theexpensive devices used in previous systems, we estimate thehuman pose from a normal web camera using deep learningbased approaches. Additionally, the existing training systemmainly used the human skeleton for visual feedback. Thismakes it difficult to understand the differences betweentarget and current poses because the human body is involumetric shapes. Therefore, we obtain an SMPL-based 3Dhuman shape using the proposed user interface to solve thisissue.

In this paper, we propose a visual feedback system tosupport core training for common users in his/her daily life.First, we estimate the 3D human shape and pose from asingle image using OpenPose and human mesh recoveryapproach, which can also be the frames from a web camera.Afterward, we propose user interfaces for loading targetimages and providing visual feedback to the user duringtraining. As Figure 2 shows, we employ two viewpoints ofhuman shapes to easily understand the pose differences [8].

The main contributions of this work are as follows.• We propose a visual feedback system for supporting

core training for a common user using a single webcamera.

• We provide a user interface for carrying out trainingguidance on 3D human shapes and poses using deeplearning approaches.

We believe that the proposed system can also be suitable forother sports in an inexpensive and straightforward way.

II. RELATED WORK

Pose Estimation. The human pose estimation is animportant research topic in computer vision. There arenumerous approaches for solving this issue in recent time.Recently, deep learning based estimation approaches havebeen proposed. OpenPose can provide real-time 2D poseestimation using a nonparametric representation to learn thebody parts in an image dataset [1]. DensePose applied fully

convolutional neural networks and region-based models tomap human RGB image to the 3D surface of the human body[2]. Human mesh recovery framework was introduced forconstructing 3D human meshes using a single image [9] byminimizing the reprojection loss of key points in the image.In this study, we employ these state-of-art approaches fortraining support and provide a user-friendly user interfacefor visual feedback.

Training Support System. Depth sensors have been usedin various training support systems. In a training systemof abdominal muscle exercises (sit-ups), Kinect was usedsimultaneously acquire depth information and color imagein detecting user’s training behaviors [5]. The target poseis represented using a skeleton image and additionally, thesystem also displays the stimated calorie consumption. Asimilar support system was proposed for hurdle practice forjudging the appropriate motion and correct pose during atraining process [6]. A yoga self-training system was pro-posed for instructing the user to perform correct poses usingKinect [10]. A similar system has been developed for gymfitness using real-time guidance and action assessment [11].The RGB cameras are also used in some training systems.The training system of archery practice was proposed forform improvement in a head-mounted display with twoPlayStation Eye cameras [7]. A computer vision based yogatraining system can be developed by analyzing users posefrom both front and side views using two cameras and thesystem can be used to display the body contour, skeleton,dominant axes, and feature points information [12]. Existingtraining systems usually require a depth sensor or multiplecameras with careful registration. In this study, we employa web camera, which is inexpensive and ubiquitous in ourdaily life as the system input.

Training in Virtual Environment. The virtual envi-ronment for ball sports requires accurate physical models,real-time simulations and user interface training using bothsoftware and hardware components [13]. A previous studyintroduced the table tennis simulation using a projectionsystem and tracking system [14]. A recent study developedthe virtual environment to train rugby ball passing skills[15]. The training systems in a virtual environment aim toproduce similar training experience as in normal real-life,however, the training guidance to a target pose was absentin these systems. In this study, we aim to help the commonuser in realizing sports training under visual guidances.

III. SYSTEM OVERVIEW

Figure 3 presents the framework of the proposed coretraining system. The framework consists of two main com-ponents: pose estimation and visual feedback.

For pose estimation, we employ the OpenPose method toobtain the bounding box of the human region in the inputimage and human mesh recovery approach to generate 3Dhuman shape from the input. With the correction by the

Figure 3: Framework of the proposed core training system. The system input is the user-provided target image and the camera image ineach frame. In the runtime, the user interface for visual feedback used in the training process is designed to provide guidance to the userand to prevent injury that could result from incorrect poses.

bounding box after the normalization process, a human meshbased on SMPL model can be generated from a single inputimage.

For visual feedback, a runtime user interface is proposedto guide the user on the differences between target pose andthe current pose from the captured frame. Both target andcurrent poses are estimated from a single image using thepose estimation method discussed above.

IV. 3D SHAPE AND POSE ESTIMATION

To obtain 3D human shape and pose from a singleimage, the proposed system adopts pose estimation basedon OpenPose to realize the bounding box of the humanregion. It also employs human mesh recovery to estimatethe 3D human pose. For the generation of 3D human mesh,an SMPL model is used.

In detail, OpenPose can be used to detect the multiplepersons’ pose from an input image using multi-stage CNN.SMPL model can represent various human postures andshapes [16], which are defined by the triangular meshesM(θ, β) ∈ ~R3×N . N = 6, 980 denotes the number of meshvertices. In an SMPL model, human pose θ is defined bythe skeleton rig with K = 23 joints and θ ∈ ~R3×K+3.Therefore, pose θ has 3×K+3 = 3×23+3 = 72 parametersincluding 3 for each part and 3 for root orientation. β ∈ ~R10

denotes shape parameters for 3D human mesh model in10 dimensions of shape space using principal componentanalysis. Figure 4 presents the 3D meshes and SMPL modelsin different θ and β values.

Figure 4: SMPL model of human mesh in different poses.

Human mesh recovery (HMR) model is a deep learningbased approach for reconstructing 3D human model froma single RGB image. The human mesh model SMPL canbe generated from 9 DoF camera pose (translation, rotation,and scaling) and SMPL parameters pose θ and shape β,which can be evaluated using a 3D regression module. Inthe discriminator network, the natural human shapes aredistinguished if unnatural joint bending and unnatural bodyshapes such as too thin exist.

A. Normalization

In this study, we define an image normalization stepbefore conducting pose estimation on the image. The reasonfor performing image normalization is to ensure that thelearning dataset of HMR used in pose estimation is unifiedunder the following two conditions.

• Diagonal size of the bounding box is around 150 pixels.• Image size is 224× 224 pixels.Because the bounding box of the human region in the

input image is necessary for the implementation, we adoptOpenPose for generating the bounding box using 3D posi-tions of all joints. Finally, we applied HMR model for poseestimation on the normalized images.

V. VISUAL FEEDBACK

In this section, we discuss the user interfaces and thetraining guidance of the visual feedback used for coretraining.

A. User Interfaces

In the proposed system, the user can predefine the easy-to-see viewpoints on the pre-training user interface, as shownin Figure 5. The user can first, select the target image witha standard pose to be used for core training. Afterward, thesystem generates a 3D human model in two view windowsusing the proposed pose estimation approach. For each viewwindow, the user can select the good viewpoints in his/her

judgment. Finally, the system saves the predefined set ofviewpoints.

Figure 5: User interface for target shape generation.

Figure 6 shows the user interface of the system forperforming core training. The system input of the targetimage and the camera image is located in the middle partof the interface. In the two view windows, the 3D humanmodels can be generated from camera images using poseestimation in the predefined viewpoints. The 3D humanmodel in blue color denotes the target 3D pose while the3D human model in red color represents the current posefrom the camera image. In the user interface, the currentpose model and the target pose model are displayed in asuperimposed manner. Different colors of markers are usedto indicate the differences between current and target modelsusing for training guidance. We used red, orange, yellowand green-yellow colors in our implementation to representdistances ranging from far to close.

Figure 6: User interface for core training.

B. Training Guidance

Herein, we present details of the training guidance im-plementation for the proposed training user interface. Todisplay the target and current 3D human models in a super-imposed manner, we rendered two models simultaneouslyusing OpenGL. In the implementation, we simply alignedtarget and current models by coinciding the origins of two3D human models at the waist position. Although a moreaccurate alignment algorithm such as iterative closest point

method may be helpful, we considered the simplicity ofthe proposed approach as a trade-off between accuracy andcomputation cost.

For representing the differences between target and cur-rent 3D models, we adopt color visualization on predefined10 marker positions on the human body. We set the markerpositions at both hands and elbows, shoulders, knees, andankles because the differences of the two models are moreapparent at the end parts than the human trunk. We did notchoose hip joints on the body because the differences inhip points are not apparent due to coincided waist positions.We used shoulders, elbows, and hands positions for a goodrepresentation of the upper body postures, which are themost important parts in core training.

To obtain the marker positions using SMPL-based humanmodel, we superimposed the aforementioned joint positionswith SMPL mesh model to determine the start and end pointsfor each marker, as shown in Figure 7.

Figure 7: Start and end points in the image coordinate.

Figure 8: Mesh vertices for each marker position.

To obtain the mesh vertices of markers, we substitutedthe transformation matrix of perspective projection from the3D model to the output image. The relation existing in theperspective projection can be expressed as follows. ximage

yimage

1

=

500×Xmodel

Zmodel+2 + 332.5500×Ymodel

Zmodel+2 + 325

1

(1)

Figure 9: Core training tasks used in our user study with target pose images (first row), skeleton information (second row), and ourproposed 3D human shapes in two viewpoints (third row).

where 500 denotes the focal length, 332.50 and 325.00denote the optical center in x and y directions, respec-tively. (ximage, yimage) denotes the pixel positions in theoutput image and (Xmodel, Ymodel, Zmodel) represents the3D vertex positions on human model. We search the vertexpositions for all markers using the following constraints.

Xmodel >xs − 332.5

500(Zmodel + 2) (2a)

Xmodel <xe − 332.5

500(Zmodel + 2) (2b)

Ymodel >ys − 325

500(Zmodel + 2) (2c)

Ymodel <ye − 325

500(Zmodel + 2) (2d)

where (xs, ys) denotes the start point position, and (xe, ye)denotes the end point position in the image coordinateas shown in Figure 7. Figure 8 shows the mesh vertexpositions of all markers on SMPL human mesh used in theimplementation.

Finally, we calculate the difference, de in vertex positionswith four colors. The color of marker point is set to be redcolor if de ∈ [0.5,∞); orange color if de ∈ [0.25, 0.5);yellow color if de ∈ [0.1, 0.25); and green-yellow if de ∈[0, 0.1).

VI. USER STUDY

To verify the effectiveness and feasibility of the proposedvisual feedback system, we compared the proposed systemwith the conventional visual feedback of skeleton informa-tion, as shown in Figure 10. The skeleton guidance is usuallyemployed in training support system with a depth sensor likeKinect [5], [6], [10], [11]. Because of the camera view ofa depth sensor, only one viewpoint is available in skeletonguidance. However, the proposed guidance approach pro-vides two viewpoints, thus, benefiting from the use of 3Dhuman shape.

To avoid the effects of different hardware, we used theproposed pose estimation method to obtain skeleton informa-tion in the user study. In the skeleton guidance, we adopted18 joints obtained from COCO dataset used in OpenPose[17]. They include ears, eyes, nose, neck, shoulders, elbows,hands, hips, knees and ankles.

A. Training Tasks

We had eight participants in the user study, all of whomare male graduate students around 25 years old. The par-ticipants were asked to do core training with the twovisual guidance methods. In the user study, we selectedfour core training tasks in different target poses includinglateral pose, supine pose, and two standing poses as commontraining tasks [18]. Figure 9 shows the target pose image

Figure 10: Two types of training guidance used in the user studywith skeleton guidance (1) and the proposed guidance approach(2).

in 2D, the skeleton guidance obtained using the proposedpose estimation method, and the 3D human shapes andposes in two viewpoints obtained using the proposed visualfeedback system. Note that the human arms of human shapeshappened to be curved due to the limitation of state-of-artshape estimation algorithm.

We asked the participants to complete the core trainingtasks for four times with different combinations of trainingguidances and tasks. The task combinations are presentedin Figure 11. It corresponds to two training guidances, asshown in in Figure 10 and four assigned core training tasks,as shown in Figure 9.

Figure 11: The training guidance and tasks for each participant forfour times. For instance, the participant C at the third time (2-4)was asked to complete core training task 4 (Figure 9) under theguide 2 of the proposed guidance (Figure 10).

B. Training Evaluation

To make a quantitative evaluation of the two trainingguidances, we define two metrics of training accuracy andtime. To account for training accuracy, we adopt root meansquare error (RMSE) to evaluate the differences in the targetand current poses. We obtained the following formulation of

RMSEi at time ti.

RMSEi =

√√√√ 1

N

N∑k=1

( ~Ek(ti)− ~Mk(ti)) (3)

where N = 5, 628 denotes the number of model meshvertices except the head part. Because the user performs posecorrection while looking at the display screen, we excludedthe mesh vertices of the head part in RMSE calculation dueto the possibility of inappropriate head movement. ~Ek and~Mk are k-th vertex positions on current and target human

models at time ti. We define the training accuracy R asfollows.

R =RMSE0 −RMSEmin

RMSE0× 100 (4)

where RMSE0 denotes the initial value of RMSE andRMSEmin denotes the minimum value of RMSE at timetmin. Herein, we use tmin as the training time in comparingthe training guidances.

C. Questionnaire

The participants were asked to spend around 25 secondsfor each training task and answer a questionnaire from hisusage experience. The questions were designed to confirmthe use of the proposed visual feedback system in the userstudy from the aspect of different viewpoints, usage of 3Dhuman shape, markers-based guidance, and functions.

Figure 12: Questions listed in the questionnaire presented to theusers after the user study.

VII. RESULT

Figure 13 shows a user performing core training using theproposed visual feedback from the training user interface.The results of both objective and subjective evaluations arediscussed below.

Figure 13: Core training using the proposed visual feedbacksystem.

A. Objective EvaluationFrom the analysis of the results obtained from the user

study, we discovered that the training tasks 3 and 4 caneasily be achieved without having large differences betweencurrent and target poses. The calculated average trainingaccuracy R obtained using skeleton guidance is 33.88 and40.18 when the proposed guidance was used. Figure 14shows the boxplots of both guidances. This shows that theproposed system is more effective in correcting postureswhen performing core training.

Figure 14: Comparison of training accuracy between skeletonguidance (blue) and the proposed guidance (orange).

Figure 15 shows a comparison of valid training timebetween skeleton guidance and the proposed system. Thisresult shows that the proposed system requires less time toachieve the correct posture.

B. Subjective EvaluationFigure 16 shows the results of questionnaires from our

user study. All the participants agreed that the proposed

Figure 15: Comparison of training time between skeleton guidance(blue) and the proposed guidance (orange).

guidance using 3D human model is easier for understandingtarget posture (Question 2). Most of the participants recog-nized that it becomes easy to understand the target posturewith changing viewpoints (Question 4, an average score of4.38) and 3D human shape (Question 5, an average score of4.75).

The color changing markers is thought to be useful inperforming core training. However, only one participantnoticed that the change in marker sizes in the training may bedue to small maker sizes. We observed that the participantspreferred checking the whole human mesh in pose correctionwhen they became familiar with the proposed system. Wethink that color change in the visual feedback on thewhole body meshes may have better support in the trainingprocess. Additionally, the estimated human model somehowmismatched the user perception (Question 8, an averagescore of 3.88) because of the limitations of the proposed poseestimation method. This issue could be solved by allowingthe user to correct the generated human model interactivelyin the pre-training user interface (Figure 5).

VIII. CONCLUSION

In this work, we proposed a visual feedback systemfor performing core training using a standard web camera.In pose estimation step, we applied deep learning basedapproaches in 3D human shape and pose estimation bycombining OpenPose and human mesh recovery methods.During the runtime process, we estimated the 3D humanmodels from the target image and current camera frameusing the proposed pose estimation strategy. Finally, weproposed a user interface for supporting core training witha guide on the differences in markers’ positions on the hu-man model. We evaluated the performance of the proposedsystem through the user study. We conducted the user study

Figure 16: Questionnaire results. For question 1 and 7, © denotesYES and × denotes NO; For question 2, 1 denotes 2D skeletonand 2 denotes 3D model.

by comparing its performance with that of the conventionalskeleton-based training guidance method.

As a limitation of this study, the current execution timeof the proposed visual feedback system is about 2 secondsfor one frame. Although it may be suitable for core trainingbecause the user has to maintain the pose over a long period,it will be bottle-neck for other sports such as gymnastics andball sports where movement is in high speed. We believe thisissue can be improved using parallel computing on GPUswith high specifications. The visual feedback for the highspeed sports can include not only the static pose and alsohigh-level guidance on velocity or momentum. Additionally,the standard SMPL model was used in this study andthe actual human body cannot be captured in real-time. Itis interesting that we can apply the training guidance ondifferent users considering their body types. In the currentimplementation, the body shape alignment was based on theroot point, it will be helpful to align at the foot joints.

In future work, we plan using head-mounted display forthe user to have good viewpoints in a virtual environment.We also plan to apply the proposed visual feedback systemto other training and sports in the near future.

REFERENCES

[1] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh: Open-Pose: Realtime Multi-Person 2D Pose Estimation using PartAffinity Fields, arXiv preprint arXiv:1812.08008, (2018).

[2] Riza Alp Guler, Natalia Neverova, Iasonas Kokkinos: Dense-Pose: Dense Human Pose Estimation In The Wild, The IEEEConference on Computer Vision and Pattern Recognition(CVPR), (2018).

[3] T. R. Walter: WORLDWIDE SURVEY OF FITNESSTRENDS FOR 2018, The CREP Edition, ACSM Health andFitness Journal, 21(6), pp.10-19,(2017).

[4] J. Stephenson, A. M. Swank: Core Training: Designing aProgram for Anyone, Strength and Conditioning Journal,26(6) pages 34-37, (2004).

[5] Daisuke Takaku and Katsuto Nakajima: Strength SupportTraining System by Kinect, 77th Annual IPSJ ConferenceProceedings, Vol.2015, No.1, pp.427-428, (2015).

[6] Masaru Okamoto, Tomoyuki Isomura, and Yukihiro Matsub-ara: Real-time Training Support Environment using PostureEstimation Method, The 30th Annual Conference of theJapanese Society for Artificial Intelligence, (2016).

[7] Kaiza Matsumura and Takafumi Koike: Form ImprovementSystem by Third Person Views on Head Mounted Display,78th Annual IPSJ Conference Proceedings, Vol.2016, No.4,pp.357-358, (2016).

[8] Kasumi Tarukawa, Tomoo Inoue, and Kenichi Okada: Collab-orative Workspace Using Multiple Perspectives for Support-ing Soccer Strategy Meetings. Transactions of InformationProcessing Society of Japan, Vol.1, No.1, pp.19-26 (2013).

[9] A. Kanazawa, M. J. Black, D. W. Jacobs, J. Malik: End-to-end Recovery of Human Shape and Pose, Computer Visionand Pattern Regognition, CVPR, (2018).

[10] H. T. Chen, Yu-Zhen He, C. L. Chou, S. Y. Lee, B. S. P. Linand J. Y. Yu: Computer-assisted self-training system for sportsexercise using kinects, IEEE International Conference onMul-timedia and Expo Workshops (ICMEW), pp. 1-4,(2013).

[11] Xin Jin, Yuan Yao, Qiliang Jiang, Xingying Huang, JianyiZhang, Xiaokun Zhang, Kejun Zhang: Virtual PersonalTrainer via the Kinect Sensor, IEEE 16th International Con-ference on Communication Technology (ICCT), pp. 460-463,(2015).

[12] Hua-Tsung Chen, Yu-Zhen He, Chun-Chieh Hsu: Computer-assisted yoga training system, Multimedia Tools and Appli-cations, Vol. 77, No. 18, pp. 2396923991, (2018).

[13] Helen C. Miles, Serban R. Pop, Simon J. Watt, Gavin P.Lawrence, Nigel W. John: A review of virtual environmentsfor training in ball sports, Computers and Graphics, Vol.36,No.6, pp.714-726, (2012).

[14] G. Brunnett, S. Rusdorf and M. Lorenz: V-Pong: an immer-sive table tennis simulation, IEEE Computer Graphics andApplications, Vol. 26, No. 4, pp. 10-13, (2006).

[15] Miles HC, Pop SR, Watt SJ, Lawrence GP, John NW, Perrot V,Mallet P, Mestre DR: Investigation of a Virtual Environmentfor Rugby Skills Training, 2013 International Conference onCyberworlds, pp. 56-63, (2013).

[16] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, M. J.Black: SMPL: A Skinned Multi-Person Linear Model, ACMTrans. Graphics (Proc. SIGGRAPH Asia), 34(6) pages 248:1–248:16, (2015).

[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.Ramanan, P. Dollr , C. L. Zitnick: Microsoft COCO: Com-mon Objects in Context, European Conference on ComputerVision (ECCV), pp. 740-755, (2014).

[18] Kiba Katsumi: Core Training to Improve Core Strength,Seibido Shuppan, (2011).

visual feedback for core training with 3d human shape and...

Documents