[ieee 2014 ieee 27th canadian conference on electrical and computer engineering (ccece) - toronto,...

6
Use of Kinect in a Multicamera setup for action recognition applications Omar Kayal Department of Electrical and Computer Engineering University of Western Ontario London, Ontario N6A 3K7 Email: [email protected] Jagath Samarabandu Department of Electrical and Computer Engineering University of Western Ontario London, Ontario N6A 3K7 Email: [email protected] Abstract—Conventional human action recognition methods use a single light camera to extract all the necessary information needed to perform the recognition. However, the use of a single light camera poses limitations which can not be addressed without a hardware change. In this paper, we propose a novel hardware setup to help solve many of said limitations without changing the available human action algorithms. The setup uses the depth information from the Microsoft Kinect camera for the Xbox 360 gaming console, and a secondary light camera in a stereo camera configuration. In the paper we explain how the configuration is setup, and how it is used to extract the skeletal data from the Kinect camera and project this skeletal data onto the secondary camera’s image. The results of the extraction are tested for accuracy, and potential applications are discussed. I. INTRODUCTION Human action recognition (HAR) in computer vision is becoming increasingly in high demand due to its vest ap- plication base ranging from surveillance, health care, sports, virtual reality, games, and much more, and the need for a reliable, markerless systems has never been so critical. There are many approaches to the problem, new algorithms getting developed, new hardware designed, and all sorts of tools. However, developing a method that can cope with a broad range of actions, especially in complex scenarios, is still a challenge. Recent advances in computer vision and pattern recognition had made it possible to recognize more complex action, making for more reliable systems. In recent years, many publication try to address different scenarios for action recognition applications by implement- ing different types of feature extraction. Cheema et al. [1] demonstrated the use of binary silhouettes to extract key poses for action learning and recognition. Dalal et al. in [2] use Histogram of oriented gradients as a feature set for human detection. Gehrig et al. in [3] demonstrates the use of optical flow motion gradient histograms. Yilmaz and Shah in [4] build 3D volumes to describe actions by exploiting people contour- point tracking . This related work, however, only attempts to address single camera scenarios. Many actions are hard to recognize with singe views, and generally are not capable of handling occlusions. The shape and motion information vary greatly to represent a specific action efficiently, since this information is highly dependent of viewpoint [5]. To address this issue, different solutions have been pro- posed. One of the popular attempts have been made in using multiple cameras, each camera with a slightly different view of the same scene. In [6], Mustafa el al. developed a method to handle object (Humans and other objects) tracking even in the presence of occlusions by utilizing a multicamera setup. In the event a tracked object gets heavily occluded from one camera, the system can keep track of the object by using the other camera where the object is still visible. Furthermore, Mustafa el al. improves occlusion handling by exploiting geometric and dynamic constrains between the two cameras, the trajectory of the target, captured from different view points, can be predicted from the tracking data. A similar approach has been demonstrated by Calderara el al. in [7]. Instead of tracking an object as a whole, Calderara el al. aims at tracking a certain number of automatically segmented relevant areas of the human silhouette, which describe the motion for action recognition. Their proposed approach employs the use of a multicamera setup, with the use of Mixture of Guassians for segmentation. A multicamera setup aims at improving results in the action recognition field, especially where occlusions are present. However, this setup lacks any spatial data, and relies on 2D data. Different authors have attempted using stereo cameras to acquire depth data. In [8] and [9], Harville et al. and Darrell et al. respectively, used a stereo camera in their approach to extract the depth information in the attempt to better track multiple people in a frame. Both papers use the depth information to better segment the foreground from the background data, and extract silhouettes to work with. Although Cheung and Woo in [10] have successfully attempted to deal with occlusions in a stereo camera setup, they are generally not as robust when it comes to occlusions as a multi view camera setup. Extracting 3D data is possible in a multi view setup, however computation cost is much higher than with a stereo camera setup, and might not be practical in real world applications. The Microsoft Kinect camera (Kinect will be used inter- changeably with Microsoft Kinect throughout this paper), is a Microsoft product originally built for the purpose of in game gesture recognition at an affordable price. However, developers CCECE 2014 1569888759 1 978-1-4799-3010-9/14/$31.00 ©2014 IEEE CCECE 2014 Toronto, Canada

Upload: jagath

Post on 09-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

Use of Kinect in a Multicamera setup for actionrecognition applications

Omar KayalDepartment of Electrical and

Computer EngineeringUniversity of Western Ontario

London, Ontario N6A 3K7Email: [email protected]

Jagath SamarabanduDepartment of Electrical and

Computer EngineeringUniversity of Western Ontario

London, Ontario N6A 3K7Email: [email protected]

Abstract—Conventional human action recognition methods usea single light camera to extract all the necessary informationneeded to perform the recognition. However, the use of a singlelight camera poses limitations which can not be addressed withouta hardware change. In this paper, we propose a novel hardwaresetup to help solve many of said limitations without changingthe available human action algorithms. The setup uses the depthinformation from the Microsoft Kinect camera for the Xbox 360gaming console, and a secondary light camera in a stereo cameraconfiguration. In the paper we explain how the configuration issetup, and how it is used to extract the skeletal data from theKinect camera and project this skeletal data onto the secondarycamera’s image. The results of the extraction are tested foraccuracy, and potential applications are discussed.

I. INTRODUCTION

Human action recognition (HAR) in computer vision isbecoming increasingly in high demand due to its vest ap-plication base ranging from surveillance, health care, sports,virtual reality, games, and much more, and the need for areliable, markerless systems has never been so critical. Thereare many approaches to the problem, new algorithms gettingdeveloped, new hardware designed, and all sorts of tools.However, developing a method that can cope with a broadrange of actions, especially in complex scenarios, is still achallenge. Recent advances in computer vision and patternrecognition had made it possible to recognize more complexaction, making for more reliable systems.

In recent years, many publication try to address differentscenarios for action recognition applications by implement-ing different types of feature extraction. Cheema et al. [1]demonstrated the use of binary silhouettes to extract key posesfor action learning and recognition. Dalal et al. in [2] useHistogram of oriented gradients as a feature set for humandetection. Gehrig et al. in [3] demonstrates the use of opticalflow motion gradient histograms. Yilmaz and Shah in [4] build3D volumes to describe actions by exploiting people contour-point tracking . This related work, however, only attempts toaddress single camera scenarios. Many actions are hard torecognize with singe views, and generally are not capableof handling occlusions. The shape and motion informationvary greatly to represent a specific action efficiently, since thisinformation is highly dependent of viewpoint [5].

To address this issue, different solutions have been pro-posed. One of the popular attempts have been made in usingmultiple cameras, each camera with a slightly different viewof the same scene. In [6], Mustafa el al. developed a method tohandle object (Humans and other objects) tracking even in thepresence of occlusions by utilizing a multicamera setup. In theevent a tracked object gets heavily occluded from one camera,the system can keep track of the object by using the othercamera where the object is still visible. Furthermore, Mustafael al. improves occlusion handling by exploiting geometric anddynamic constrains between the two cameras, the trajectoryof the target, captured from different view points, can bepredicted from the tracking data. A similar approach has beendemonstrated by Calderara el al. in [7]. Instead of trackingan object as a whole, Calderara el al. aims at tracking acertain number of automatically segmented relevant areas ofthe human silhouette, which describe the motion for actionrecognition. Their proposed approach employs the use of amulticamera setup, with the use of Mixture of Guassians forsegmentation.

A multicamera setup aims at improving results in the actionrecognition field, especially where occlusions are present.However, this setup lacks any spatial data, and relies on 2Ddata. Different authors have attempted using stereo camerasto acquire depth data. In [8] and [9], Harville et al. andDarrell et al. respectively, used a stereo camera in theirapproach to extract the depth information in the attempt tobetter track multiple people in a frame. Both papers usethe depth information to better segment the foreground fromthe background data, and extract silhouettes to work with.Although Cheung and Woo in [10] have successfully attemptedto deal with occlusions in a stereo camera setup, they aregenerally not as robust when it comes to occlusions as a multiview camera setup. Extracting 3D data is possible in a multiview setup, however computation cost is much higher thanwith a stereo camera setup, and might not be practical in realworld applications.

The Microsoft Kinect camera (Kinect will be used inter-changeably with Microsoft Kinect throughout this paper), is aMicrosoft product originally built for the purpose of in gamegesture recognition at an affordable price. However, developers

CCECE 2014 1569888759

1

978-1-4799-3010-9/14/$31.00 ©2014 IEEE CCECE 2014 Toronto, Canada

Page 2: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

quickly realized its potential, and its capabilities that couldbe matched to other state of the art depth sensing cameras,but with a much lower price tag. According to [11], Kinectachieves comparably results to continuous wave amplitudeTime of Flight (TOF) cameras when trying to extract depthinformation. Many publications have already been made underthe computer vision banner. In [12] Xia et al. develops amethod to detect a human from different poses by using depthinformation from Kinect. In [13], Xia et al. uses [14] methodby Shotton et al. to extract the 3D skeletal joint locations fromKinect’s depth image, and uses histogram of 3D joint locationas features for their action recognition algorithm. Kinect’scomplementary nature of depth, and color data opens up manyopportunities to solve problems in computer vision [15].

With Kinect’s skeletal tracking through Shotton et al.’smethod [14], developers can focused on the application andleave the pose recognition hard work to the Kinect. The Kinectshowed promising results in health care applications. In a con-trolled body pose, Kinect’s joint estimation is comparable tomarker based motion capture, making it a low cost alternativeto similar rehabilitation based equipment. In [16], Huang etal. attempted to use the Kinect in their Kinerehab system withpromising results. Roy et al. [17] also attempted a similarapproach using kinect in a very low cost system with goodresults.

The kinect however is still a single device, and its ability tohandle occlusions is not very effective. Obdrzalek et al. in [18]tests the reliability and accuracy of the Kinect’s human poseestimation based on [14]. Obdrzalek et al. attempts to comparethe Kinect pose estimation to more established pose estimationtechniques used in motion capture. The depth accuracy of theKinect depth sensor ranges from 1-4cm at a range of 1-4m.Their results conclude that the Kinect pose estimation failsin the presence of occlusions, even self occlusion of otherlimbs or facing away from the camera can result in inaccurateinferred results. However, when fully facing the camera, theKinect achieves results comparable to motion capture. Sincethe Kinect was originally build for the Xbox 360 gamingconsole, it was assumed that the user would constantly befacing the camera.

In an effort to address some of the said drawbacks, wecame up with a novel approach where we utilize the strengthof the Kinect in a multicamera setup. By using the multi viewcapabilities of a multicamera, the setup is able to deal withocclusions better, and through the Kinect, depth and skeletaldata can be extracted. In addition, we propose a way to projectthe skeletal data from the Kinect to the secondary camera inthe setup. This provides additional capabilities which can beutilized in different scenarios. The use of a dual Kinect setuphas been considered, however, based on [19], the infrareddot projections used by the Kinect’s IR emitter for depthmeasurement can interfere with the second Kinect’s IR emitter,resulting in undesired results.

The reminder of the paper is organized as follows: Section2 will cover an overview of the setup, section 3 will cover howstereo calibration is implemented. Section 4 will demonstrate

the skeletal projection from Kinect to the secondary camera.In section 5 we will show how to fix projection errors dueto inaccurate stereo calibration. Section 6 will show experi-mentation and results. Section 7 will be discussion and futurework. Section 8 concludes this paper.

II. OVERVIEW

In our approach, the Kinect will be the base camera, andthe other camera as the secondary. The base camera will actas the center of origin and as the main central reference point.Once the cameras are setup in a suitable position, their relativedistance and Euler angle orientation have to be found. Thisinformation will be used to project the skeletal informationextracted from the kinect’s depth data through [14] method.Each camera’s view is setup to partially overlap the othercamera’s view, and used at a much steeper horizontal angle asshown in figure 1. This provides a much expanded view thatwould not be achieved with a single camera.

Fig. 1: Stereo Camera Setup

The distance and Euler angle data can be extracted bycalibrating both cameras to find the intrinsic and extrinsiccamera parameters. Upon a successful calibration, the camerasgeometric data with respect to world coordinates will beestablished. Then this data is used to find the relative distanceand orientation of one camera with respect to the other.

Once the relative geometric data is found, it will be usedto project the extracted skeletal data to the other camerasimage plane by using geometric transformation techniques.To improve results, we implemented a feedback system thatchecks for projection errors on the projection image plane ofthe secondary camera by matching the human silhouette dataextracted through background subtraction onto the projectedskeletal data from the Kinect. Should the match fail, thesilhouette data is segmented to five parts, head, body, righthand, left hand and lower body, and the information is usedto incrementally translate and rotate the projection until theskeletal and silhouette data match. Upon a successful match,the new relative distance data is updated, and the setup is readyfor use. Figure 2 summarizes the process flow.

2

Page 3: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

Fig. 2: Process layout

III. STEREO CALIBRATION

In order to achieve a proper skeletal projection, the relative3D distance and Euler orientation between the two camerashas to be known. This data can be extracted using calibrationmethods.

A. Calibration

Camera calibration is an important part in computer vision,and is used extensively in many applications, especially indistortion correction applications. The main purpose of cali-bration is usually to estimate important camera parameters of apin-hole camera model. The results of calibration yields the in-trinsic and extrinsic parameters of a camera, where the intrinsicparameters are camera properties, and the extrinsic parametersare the translation and rotation transformation between thecamera’s local axis and world axis. Equation (1) and (2)show the camera Matrix for intrinsic parameters and rotation-translation matrix for extrinsic parameters respectively. Bothparameters have to be estimated by computing the differencebetween a dimensionally known chessboard pattern, and its 2Dcaptured plane. In equation (1), cx and cy are the principlepoints, usually the image center, while fx and fy are thecamera’s focal length. In equation (2), the R matrix containsthe rotation matrix, while the T part of the matrix containsthe x, y and z components of the translation vector to worldcoordinates.

A =

fx 0 cx0 fy cy0 0 1

(1)

[R|T ] =

r11 r12 r13 t1r21 r22 r23 t2r31 r32 r33 t3

(2)

In our approach, we use OpenCV (Intel R© open sourcecomputer vision library). OpenCV implements its calibrations

based on Zhang [20]. Upon a successful calibration, theresult camera matrix from both cameras will be used in thestereo calibration process to extract the relative 3D space databetween the two cameras.

B. Stereo Calibration

By calibrating the two cameras at the same time using thesame chessboard pattern, the relative geometric relation ofeach camera will be relative to the same world axis. Fromthis, the relative geometric data can be extracted. OpenCVimplements this stereo calibration using [21] by Hirmuller.The method uses predetermined camera matrix parametersextracted from the individual calibration of each camera. Themethod then output the rotation-translation matrix, in additionto other parameters which will be used in our approach.

IV. SKELETAL PROJECTION

In this section, the method for projecting the skeletal dataextracted from the Kinect camera is explained. A success-ful calibration produces the Camera Matrix and rotation-translation matrix, as shown in equations (1) and (2). Forthe purpose of skeletal projection, only the rotation-translationmatrix is required.

A. Euler angle

The results of the stereo calibration yields the rotationmatrix of the secondary camera’s orientation with respect tothe kinect. The rotation matrix is a generalized matrix ofthe form R = Rz(φ)Ry(θ)Rx(ψ) composed of three Eulerrotation matrices. Slabaugh in [22] provides an extensiveapproach to calculate the Euler angles from the rotation matrix.The Kinect’s default geometric axis is setup to have the Y axispointing upward, the X axis to the left of the Kinect and the Zaxis pointing in front of the Kinect as shown in figure 3a. Topreserve relativity, the same axis system will be used on thesecondary camera. The three resulting Euler angles are shownin figure 3a. In a typical setup, both cameras are put on flatsurfaces, so the camera might be tilted vertically with respectto the ground (around the x axis) or horizontally with respectto the Kinect (around the y axis). However, its unlikely that itmight be tilted to the left or right around the z axis. For thatreason, only Rx and Ry will be used in the projection process.

B. Projection

By using [14], the Kinect is capable of extracting 20 skeletalpoints in 3D space with respect to the defined Kinect local axisshown in figure 3a, as shown in figure 3b. Each of these pointshave to be projected in a way to match the secondary camera’simage plane. The process is broken down as follows:

1) Raw skeletal data extraction: In this step we calculatethe three components of each joint point in 3D space withrespect to the Kinect axis from the depth data using [14].

3

Page 4: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

(a) Kinect Axis and Euler orien-tation

(b) Kinect Skeletal data

Fig. 3: Kinect [23]

2) Projection: The projection is a simple matter of axistransformation. The aim of projection is to view the skeletaldata from the secondary camera’s perspective. To achieve this,each skeletal 3D point is first translated using the T vectorfrom the rotation-translation matrix (2). The results are thenrotated horizontally around the Y axis using the Euler matrixRy with angle θ. Lastly, the results are rotated verticallyaround the X axis using the Euler matrix Rx with angle ψ.The results are shown in equations (3) and (4), where Px(N),Py(N) and Pz(N) are the Nth intermediate projected points ofthe skeletal structure, Sx(N), Sy(N) and Sz(N) are the Nthskeletal points from Kinect, Tx, Ty and Tz are the translationcomponents of the translation vector and P

′is the final rotated

point.

Px(N)Py(N)Pz(N)

=

cos θ 0 sin θ0 1 0

− sin θ 0 cos θ

∣∣∣∣∣∣Sx(N)Sy(N)Sz(N)

∣∣∣∣∣∣−∣∣∣∣∣∣TxTyTz

∣∣∣∣∣∣ (3)

P ′

x(N)

P′

y(N)

P′

z(N)

=

1 0 00 cosψ − sinψ0 sinψ cosψ

Px(N)Py(N)Pz(N)

(4)

3) 3D to 2D projection: The last step is to project the 3Dpoints to the 2D image plane. The points are projected usingperspective projection and converted to pixel coordinates to beviewed on the secondary camera’s image plane.

V. SILHOUETTE BASED CALIBRATION REFINEMENT

The stereo calibration process is designed to handle camerasin a stereo configuration, or at least in close proximity. Whenthe cameras are not very close, the calibration process mightbe off and would result in inaccurate skeletal projection. Inorder to correct this, we have proposed a feedback system thatwill incrementally correct the translation and rotation offsetby measuring the segmented area centers of the extractedsilhouette difference compared to the projected skeletal datalocation.

Upon an unsuccessful match during the checking process,the system asks the user to stand in an upright position facingthe secondary camera with both arms visible and lifted atshoulder length as shown in figure 4a. The legs do not have tobe fully visible. The pose is captured and a silhouette extractedto be processed.

A. Process

1) Contour extraction: The first step is to find the contourof background subtracted binary silhouette of the human body.The background subtraction is performed using the mixtureof Gaussians method to find the silhouette. The contour ofthe silhouette is then extracted using weighted Active Snakescontours, which wrap around the silhouettes exterior, definingthe required contour. A benefit of using Active Snakes isthat they are reasonably robust against distortions and noiseproduced from the background subtraction method.

2) Silhouette Center point: The center point Scenter ofthe binary silhouette data needs to be calculated, by findingthe center weight of the silhouette, for use in the contourdistance map. The point Scenter is moved up half the distancebetween the center point Scenter, and the top most point of thehead with contour point value C[1], using (5). The new pointS

center ensures proper point identification, where the requiredpoints are shown in figure 4b.

S′

Y center = SY center +C[1]y − SY center

2(5)

3) Distance map measurement and graphing: The distancemap, featuring the measured distance between the contourpoints and the point S

center is calculated. The distance valueis then plotted against the contour order number to get resultssuch as the one in figure 4c. The more contour points used,the smoother the graph result. The graph is then smoothed outusing a normal Gaussian filer to remove unwanted noise.

4) Body segmentation: By finding the local minima andmaxima points on the graph, we can identify the points shownin figure 4b and 4c. The points can then be linked in the image,as shown in figure 4b to identify the five body sections. Thehigher center point S

center ensures that the local minima fallson the required points. Should Scenter have been used, thelocal minima would have represented unwanted points, suchas the sides of the abdomen.

5) Distance correction: The distance Dc(x, y) betweenpoints Scenter from the camera and projected HIP CENTER(figure 3b) from the Kinect is measured. The value of Dc(x, y)determines the amount HIP CENTER point should be moved,which equates to Dc(x)/2 for the X axis and Dc(y)/2 for theY axis. The process is repeated until the process converges to athreshold. The sign of Dc determines the direction of motion.

6) Angle correction: The angle correction relies on thedifference Ddiff between the euclidean distances of Dkd

(4d) and DRL, where Dkd is the hand distance view fromthe camera, and DRL is the hand distance of the projectedskeletal points from the Kinect. The projection angle θ is theniterated between 0o and 90o until the value of Ddiff falls

4

Page 5: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

(a) Capture Position (b) Contour distance measurementand area segmentation

(c) Contour Distance graph (d) Distance

Fig. 4

(a) Calibration Distance vs actual Dis-tance

(b) Calibration Angle vs actual Angle (c) Distance correction vs actual Dis-tance

(d) Angle correction vs actual Angle

Fig. 5

below a threshold. Any angle value more than 90o would notbe considered since facing both cameras at that angle is notpractical.

VI. TESTING AND RESULTS

A. Calibration

The first experiment is to test how well the calibrationresults are. In the first test, the relative camera distance willbe measured, with the angle kept constant, using eight patternviews. The results will be plotted against actual distance. Inthe second test, the orientation between the cameras is varied,relative distance kept constant and calibrated. The measureangle vs actual angle will be plotted. The calibration methodis assumed to be successful.

B. Calibration results

The results are shown in figures 5a and 5b. The resultsindicate that the calibration is accurate for both angle anddistance, however the distance measurement accuracy startsto fall beyond 60 cm of Euclidean distance between the twocameras. This is attributed to the fact the calibration patternsbecomes increasingly smaller the further away the camerasare, leading to inaccurate calibrations. However the anglecalibration face no such problem, and maintain a high accuracythroughout the test. One problem was that it was hard tocalibrate any further than 70o since the pattern was no longercompletely visible to both cameras.

C. Projection

The skeletal projection will be tested to see how wellthe algorithm performs. Due to the lack of motion captureequipment, and possible errors resulting from the Kinectskeletal capture itself, the results will be assessed based onpersonal input.

D. Projection results

The projection is highly dependent on calibration, and aslong as the calibration is accurate, the projection achievesperfect results.

E. calibration refinement

In this test, after calibration and projection, the camera willbe purposely moved and rotated to a known angle, and testedto see how accurate the calibration refinement algorithm canfix the change.

F. Calibration refinement results

Based on the results in figure 5c, the recalibration is able toalmost accurately measure the new translation. However, itsangle recalibration accuracy falls short Initially as shown infigure 5d. However, it increases to more acceptable results athigher angles. Due to the way the algorithm is implemented,it does not take into account spatial data, leading to roughestimations. The person must also be fully visible to the Kinectfor the approach to work, which presents the limitation of the

5

Page 6: [IEEE 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) - Toronto, ON, Canada (2014.5.4-2014.5.7)] 2014 IEEE 27th Canadian Conference on Electrical

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

angle measurement to only 50o at 50cm distance. However,the distance at which the method is able to measure is farmore than what the calibration can achieve, having accurateresults of distances up to 1.2m, compared to 60cm from thecalibration.

VII. FUTURE WORK

The next step is to try to further improve the results and fullyautomate the process. In addition to that, research is needed totest what kind of action recognition works best with all of theavailable features, such as a combination of RGB and skeletal,silhouette and depth, etc.

Further development of the recalibration step can lead to afull auto-calibration process that would eliminate the need touse a chessboard calibration approach.

Our approach can also be used for non-action recognitionrelated applications. This could include improving the skeletaltracking of the Kinect by using the view from the othercamera. The secondary camera can implement some form oflimb tracking, and this information can be fed to the skeletaltracking algorithm in the event an object is occluding theKinect’s view.

VIII. CONCLUSION

In this paper, we have successfully build and tested a Multicamera system that utilizes the combined strength of theMicrosoft Kinect’s depth sensor camera and the robustnessand occlusion handling of a multi view setup. The setupuses Shotton et al.’s method in [14] to extract the skeletaldata from the Microsoft Kinect’s depth information, whichis then projected onto the secondary camera’s image plane.The projection relies on the geometric data extracted byHirschmuller stereo calibration method in [21]. Our approachalso features a feedback system to correct stereo calibrationerrors. The correction approach is also meant to simplifythe calibration procedure, producing promising results in itsinitial stages of development. However its accuracy falls shortcompared to the pattern calibration, leading for much desiredimprovement. On the upside, the correction method is able tomeasure higher distance than what the calibration method can,giving an intensive for future development.

REFERENCES

[1] S. Cheema, A. Eweiwi, C. Thurau, and C. Bauckhage, “Action recog-nition by learning discriminative key poses,” in Computer Vision Work-shops (ICCV Workshops), 2011 IEEE International Conference on, 2011,pp. 1302–1309.

[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 886–893 vol. 1.

[3] D. Gehrig, H. Kuehne, A. Woerner, and T. Schultz, “Hmm-based humanmotion recognition with optical flow data,” in Humanoid Robots, 2009.Humanoids 2009. 9th IEEE-RAS International Conference on, 2009, pp.425–430.

[4] A. Yilmaz and M. Shah, “Actions sketch: a novel action representation,”in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEEComputer Society Conference on, vol. 1, 2005, pp. 984–989 vol. 1.

[5] M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent human actionrecognition based on a stereo camera,” in Pattern Recognition, 2009.CCPR 2009. Chinese Conference on, 2009, pp. 1–5.

[6] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps, “Dynamicsubspace-based coordinated multicamera tracking,” in Computer Vision(ICCV), 2011 IEEE International Conference on, 2011, pp. 2462–2469.

[7] S. Calderara, A. Prati, and R. Cucchiara, “A markerless approach forconsistent action recognition in a multi-camera system,” in DistributedSmart Cameras, 2008. ICDSC 2008. Second ACM/IEEE InternationalConference on, 2008, pp. 1–8.

[8] M. Harville and D. Li, “Fast, integrated person tracking and activityrecognition with plan-view templates from a single stereo camera,” inComputer Vision and Pattern Recognition, 2004. CVPR 2004. Proceed-ings of the 2004 IEEE Computer Society Conference on, vol. 2, 2004,pp. II–398–II–405 Vol.2.

[9] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, “Integrated persontracking using stereo, color, and pattern detection,” in Computer Visionand Pattern Recognition, 1998. Proceedings. 1998 IEEE ComputerSociety Conference on, 1998, pp. 601–608.

[10] P.-M. Cheung and K.-T. Woo, “Mcmc-based human tracking with stereocameras under frequent interaction and occlusion,” in ComputationalIntelligence for Security and Defence Applications (CISDA), 2012 IEEESymposium on, 2012, pp. 1–8.

[11] B. Langmann, K. Hartmann, and O. Loffeld, “Depth camera technologycomparison and performance evaluation,” in Center for Sensor Systems,2012, pp. 438–444.

[12] L. Xia, C.-C. Chen, and J. Aggarwal, “Human detection using depthinformation by kinect,” in Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2011 IEEE Computer Society Conference on,2011, pp. 15–22.

[13] ——, “View invariant human action recognition using histograms of3d joints,” in Computer Vision and Pattern Recognition Workshops(CVPRW), 2012 IEEE Computer Society Conference on, 2012, pp. 20–27.

[14] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in Computer Vision and Pattern Recognition(CVPR), 2011 IEEE Conference on, 2011, pp. 1297–1304.

[15] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision withmicrosoft kinect sensor: A review,” Cybernetics, IEEE Transactions on,vol. 43, no. 5, pp. 1318–1334, 2013.

[16] Y.-J. Chang, S.-F. Chen, and J.-D. Huang, in In Proceedings of AS-SETS11, (Dundee, Scotland, UK.).

[17] A. Roy, Y. Soni, and S. Dubey, “Enhancing effectiveness of motorrehabilitation using kinect motion sensing technology,” in Global Hu-manitarian Technology Conference: South Asia Satellite (GHTC-SAS),2013 IEEE, 2013, pp. 298–304.

[18] S. Obdrzalek, G. Kurillo, F. Ofli, R. Bajcsy, E. Seto, H. Jimison, andM. Pavel, “Accuracy and robustness of kinect pose estimation in thecontext of coaching of elderly population,” in Engineering in Medicineand Biology Society (EMBC), 2012 Annual International Conference ofthe IEEE, 2012, pp. 1188–1193.

[19] Microsoft. Skeletal tracking. [Online]. Available: http://msdn.microsoft.com/en-us/library/hh973074.aspx

[20] Z. Zhang, “A flexible new technique for camera calibration,” IEEETrans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334,Nov. 2000. [Online]. Available: http://dx.doi.org/10.1109/34.888718

[21] H. Hirschmuller, “Stereo processing by semiglobal matching and mutualinformation,” Pattern Analysis and Machine Intelligence, IEEE Trans-actions on, vol. 30, no. 2, pp. 328–341, 2008.

[22] G. G. Slabaugh, “Computing euler angles from a rotation matrix,”Retrieved on August, vol. 6, p. 2000, 1999.

[23] Microsoft. Kinect sensor. [Online]. Available: http://msdn.microsoft.com/en-us/library/hh438998.aspx

6