[ieee 2008 canadian conference on computer and robot vision (crv) - windsor, canada...

Face and hands detection and tracking applied to the monitoring of medication intake

Soufiane Ammouri and Guillaume-Alexandre Bilodeau Département de génie informatique et génie logiciel, École Polytechnique de Montréal, P.O. Box

6079, Station Centre-ville, Montréal (Québec) Canada, H3C 3A7 {soufiane.ammouri, guillaume-alexandre.bilodeau} @polymtl.ca

Abstract This paper presents detection and tracking methods for user’s body parts in video sequences. They are applied to detect human activities, in the case of this work, the detection of medication intake. We use a technique based on color and shape to detect the body parts and the medication bottles. Color is used for skin detection, and the shape is used to distinguish the faces from the hands and differentiate bottles of medicine. To track these objects, we use methods based on color histograms, Hu moments and edges. For the recognition of medication intake, we use a Petri network and event recognition. Our method have an accuracy of more than 97% and allow the detection of the medication intake in various scenarios. Keywords: Face and hands detection, face tracking, hand tracking, medication, Petri network. 1. Introduction Detection and tracking of body parts allow the detection of human activities remotely. Knowing that technology can significantly improve the life’s quality of the elder and persons having mental deficiencies, we are interested in the problem of monitoring medication intake, which is an activity involving the detection and tracking of the face and hands. Remote monitoring of patients allow them to stay in their homes longer and offer them flexible and efficient medical monitoring. The field of the video surveillance attracted a lot of interest but it was little directed toward controlling medication intake. Recently, some papers ([1], [2]) addressed this topic. In the first paper [1], the authors use the algorithm which is describing in [3] to track the head. They modeled the head with an ellipse whose size can vary from one frame to the other. For each image, a local search determines the best fitting ellipse, based on the gradient magnitude around its perimeter

and the likelihood of skin color inside it. In the same article, the hands are localized based on regions with high skin color likelihood. These regions are extracted from the skin color likelihood map created during the head tracking process. In [1], several simplifying assumptions of the problem were used such as supposing that the patient is wearing a long-sleeved shirt. This avoids identifying the hand in the arm region which makes the system less flexible. In the second paper [2], they combined feature localization and template matching with few assumptions to detect and track the face. In the same article, an approach based on grayscale sharpness is developed to localize and track the hand. The authors in [2] determine the orientation of the fingers to recognize the opening and the closing of medication bottles. In this case, the fingers must be always visible, which is not always the case. Also, the medication intake in this article is simply detected if a sequence of actions occurs in a series of frames without analyzing the duration of these actions, which may cause an increase in the rate of false detections. Several researches have been done on object tracking. In [4], the authors proposed a robust appearance model which combines color distribution and particle filters to track non-rigid objects like face, car and football player. In [5], they used a special mask with an isotropic kernel and employ the mean-shift procedure to perform the tracking. Rui and Chen in [6] proposed to track the face contour based on the unscented particle filter. No matter which object we want to track in a video sequence, we need a model to describe it. This model can contain a priori information about the object as well as information extracted from previous frames. The model can include descriptors of color, of texture, of shape or of any other type of primitives. As part of our work, we combine these descriptors to be able to locate and track the face and hands. For this, we initially use the color to detect the skin regions contained in each frame. Thereafter, a shape descriptor

Canadian Conference on Computer and Robot Vision

978-0-7695-3153-3/08 $25.00 © 2008 IEEEDOI 10.1109/CRV.2008.20

147

is used to locate and track the face. Hand tracking is done by exploiting the edges and centroids properties of the regions. For modeling medication intake, we used a Petri network. The tokens transition from a place to the other occurs only if the events go on during a specific number of frames. The detection and tracking of medication bottles are done by combining the color histograms and a descriptor of shape which will be presented thereafter. The remainder of this document is structured as follow. In section 2, we present the method used to detect the skin regions contained in each frame of the video sequence. Section 3 presents the technique adopted for the detection and the management of occlusions between the body parts. Sections 4 and 5 describe the methods used for detection and tracking of the face and hands. In section 6, we explain how the medication bottles are located and tracked. Section 7, shows the Petri network developed for the detection of the human activity. In section 8, we provide the results and the performances of our implementation and finally section 9 concludes the paper and presents some ideas for future work.

2. Detection of skin regions

In our study, we use the color feature to detect skin regions. This method has important advantages such as speed and efficiency for skin regions detection [7]. After thresholding and segmentation of skin regions, we can detect the face and hands in the source image. This step is important since a bad localization of the face and hands will decrease the possibility of recognizing human activity. To use this simple method, we assume that background and people clothes are of different colors from skin when capturing the video sequences. The human skin is often represented by a portion of a particular color space, and it is therefore possible to extract color pixels which can be similar to that of the human skin. We begin by converting each frames of our input sequence to a color image. Subsequently, we transform our image to a color space more adapted for skin detection. We selected the HSV color space. The advantage of this space for color detection is that the channel V (Value), which represents the luminance, express colors independently of variations of luminosity. Some authors [8] have proposed appropriate thresholds for this type of operation. In our work, we used the thresholds shown in Table 1 for the detection of skin pixels. For a pixel to be labeled as a part of the skin, it must be located within all intervals of table 1.

Channel Inferior threshold

Superior threshold

Hue 0 (0 º) 0.1 (36º) Saturation 0.2 1

Value 0.2 0.8 Table 1: Thresholds used for skin detection in HSV

color space. Once the classification of skin pixels is made, connected regions are found and small regions are removed. Each Blob (region) contains information relating to a cluster of pixels, such as bounding box, pixels list and area. The area property is used to reject blobs which are too small. Figure 1 shows the result after skin detection with the method mentioned above.

Fig. 1: Example of skin regions detection. A), B), C), D), E) Frames of the original video sequences.

F), G), H), I), J) Detection of the skin regions contained in each frame.

We can notice false detections for hair (Fig.1E), as well as regions not detected (eg: small part of the forehead (Fig.1B)). In Fig.1E, the hair and the skin have similar color and therefore they are classified as skin regions. In the case of non-detected regions, some regions reflect more light and thus appear to be more illuminated. This change in the intensity of light modifies the color of the region in the HSV space. The authors of [3] showed that when the saturation S is low and the value V is high or vice versa, the color tends toward white, thus escaping detection thresholds. To limit the number of pixels that will not be detectable due to reflections, we reduced the classification interval of the value as shown in the above table. In general, the thresholds chosen limit the conflict with other colors than those of the skin. It can detect different types of skin with different conditions of illumination and also can detect face despite wearing glasses or caps (D).

148

3. Occlusion between the skin regions For handling occlusions, we assume that parts of the body (hands + face) are visible and they are not in occlusion in the initial frame. Thereafter, we examine the number of skin regions extracted in each frames to manage occlusions in the input sequence. In other words, we compare the number of skin regions detected in the current frame Ft with those in the previous frame Ft-1. This comparison allows us to identify the presence of obstructions and the number of occulted regions as shown in Fig.2.

Fig. 2: Occlusion handling using the extracted skin regions.

It is important to mention that in this work, we do not use 3-D positions. When we mention occlusion, collisions are included. Without a three-dimensional positioning, we cannot distinguish between the two. So, once an occlusion is detected, we compare the distances between skin region centroids of the previous frame to know the two closest regions in Ft-1 and which were found in occlusion in Ft. The techniques of detection and tracking of skin regions will be presented in the following sections. The problem of occlusions is one of the most widespread and difficult to handle during the tracking of multiple objects. In [9], they define the different problems when processing occlusions. Among these problems, we find the detection of the event of occlusion as well as the correct tracking of the occulted object and the object occulting during occlusion. To mitigate this last problem, it is necessary to have a feature discriminating between the tracked objects. In [10], the

author proposes to use color as a feature with a particle filter to be able to track several people. In our case, the track objects are parts of the body and do not have a discriminating color. In order to minimize false detections and errors in tracking of skin parts, the tracking of these regions is suspended for the duration of the occlusion. When objects are separated, tracking and monitoring continue. 4. Face detection and tracking Once the image’s regions segmentation is done, we can isolate pixels that potentially belong to the same object (skin regions). The purpose of this step is to detect and track human face separately from other body parts like hands in this case. To do this, we suppose that a region R in our initial frame represents a face if:

• The ratio between the width and the height of the region is smaller than 2.25. That is:

25.2engthMinorAxisL

engthMajorAxisL <RR

(1)

• The ratio between the surface and the square of the perimeter of the region is larger than

0.02. That is: 02.0)( 2 >

Perimeter

Area

RR (2)

The first test rejects regions which are too wide and narrow to be a face, such as a forearm. The second one is a measure of circularity, which identifies elliptical shape regions like the head. The authors of [2] assumed that face orientation should vary between 45° and 135°. This hypothesis is plausible if a person’s head remains straight during some activity such as taking medication. We do not make this assumption because with want our algorithms for detection and tracking to be applicable to other human activity.

Once the face is localized in the initial frame of the sequence, we calculate the second order Hu moment for corresponding blob. A Hu moment [11] is an invariant shape descriptor that represents a sum on all pixels of the image model weighted by polynomials related to pixels positions. Let us consider our image I(i,j) after the extraction of skin regions. The moment of order (p + q) for each blob is defined as

∑∑=i j

qppq jiIjim ),( (3)

and central moment of order (p + q) as ∑∑ −−=

i j

qppq jiIjjii ),()()(μ (4)

149

with 00

10

mmi = (5) and

00

01

mmj = (6).

The normalized central moment of order (p + q) is defined as

λμμ

η)( 00

pqpq = (7)

with

12

++= qpλ . (8)

These moments are invariants to translation and scaling. Hu Moments [11] are calculated from normalized moments and they are invariants with translations, rotation and scaling. We use the first and second order Hu moments:

02201 ηηφ += (9)

(10) 211

202202 4)( ηηηφ +−=

For the following frames, we calculate Hu moments of skin’s regions after the segmentation process. Then we compare them with the moment corresponding to face regions that belongs to previous frame. The use of Hu moments is justified by analyzing a sequence of frames in which the face changes orientations as shown in the fig. 3.

Fig. 3: Tracking of the skin regions. A), B), C), D)

Frames 5, 8, 11, 14, 17 and 20 of the video sequence test. E), F), G), H) Detection of the skin regions

contained in each frame. We examined the evolution of the Hu moments of order 1 ( 1φ ) and 2 ( 2φ ) of the blobs corresponding to the face of this same sequence.

Fig. 4: Comparison of the evolution of the Hu moments of order 1 and 2 of the face for the

sequence of figure 3. Fig. 4 shows that Hu moment of order 1 is more sensitive to changes of face orientations compared to the order 2. These gave us good results and consequently it is the order 2 that is used for detection and tracking of the face as previously mentioned. The idea of using a shape descriptor for tracking the face is because the face’s shape is different from hands and varies less in human activity. This consideration is verified by fig. 5. Face tracking is performed by comparing its appearance.

Fig. 5: Comparison of the evolution of the Hu

moments of order 2 of the hand and the face for a sequence of frames.

5. Hands detection and tracking After localizing the blob which represents the face, the remaining blobs which represent other skin regions are

150

detections of the bottles. We assume that at the beginning of the video sequence, the bottles are not in occlusions. This allows us to know the initial number of bottles present in the video sequence. For all the objects on the table, we calculate their colors histograms and their second order Hu moments. We thus detect the objects which have the same characteristics (color + shape) that those of the medication bottles learned. The colors histograms are treated as vectors and the distance between them is calculated using the minimum cost to switch from one distribution to the other using the minimum distance of pair assignments (MDPA) distance:

[ ] [ ]∑∑−

= =

−=1

0 0

))()(())(),((K

j

j

k

kMhkIhMhIhD . (13)

Fig. 8 shows a medication bottles detection example.

Fig. 8: Example of skin regions and medication

bottles detection. A), B), C) Frames of the original video sequences. D), E), F) Detection and tracking

objects concerned. Given that the bottles are rigid objects, we track them by comparing the centroids of the objects detected on the table of the current frame with the centroids of the medications bottles detected in the preceding frame. For each frame, we calculate the number of bottles detected. If this number in the current frame is lower than that of the previous frame, we detect that the bottle is in occlusion or collision with the user’s hand. After, we calculate the distances between the centroids of the two hands and the centroid of the bottle taken to know which hand handles the bottle. In order to limit errors in detection and tracking of bottles, the tracking of the medication bottle is suspended for the duration of an occlusion. When the object is put back on the table, the tracking is resumed.

7. Human activity recognition For human activity detection which is in our work medication intake, we used a Petri network. A Petri net is an abstract model of the flow of information in a system [13] and it is represented by a graph with two types of nodes (places and transitions) connected by arcs. A Petri net evolves when a transition is done: tokens are taken from places in the entry of this transition and sent in the exit places. A scenario is thus recognized when a token reach the final place of the Petri net. The author of [14] evokes the advantages to use the Petri networks for the representation and the recognition of events. Among these advantages, we find the representation of the events in a sequential, simultaneous and synchronized way. Our Petri net, designed for representing and detecting the medication intake, has seven places and eleven events as shown in the fig. 9.

Fig. 9: Petri network used for the medication intake recognition.

At first a token is put in initial place Pi. If one of the

events E1, E2 or E3 occurs, we move the token in the P2 place. We use the logical and temporal relations defined in [14] for the construction of our Petri network. The medication intake activity is recognized when the token reached place Pf. For the E4 event, we suppose that the bottle opening occurs when the user is in possession of it and that there is occlusion or contact between the two hands. In [2], no constraint of duration is defined to carry out the transitions. In our work, to validate transitions and avoid false detections of events, the transitions are validated only if their durations exceed a certain number of frames. The minimum number of frame for the duration (MNFD)

151

of each event was determined experimentally and is presented at table 2.

Events E1 E2 E3 E4 E5 E6

E10 E11 E7 E8 E9

MNFD 5 10 10 1 Table 2: The minimum number of frame for the

duration of each event used in our work.

8. Results In this section, we will present the results obtained on video sequences with a resolution of 320x240 taken by a Sony DFW-SX910 camera. The application is programmed with Matlab. We will start by evaluating the performances of our algorithms for face and hands tracking. To do so, we label for each frame the body parts detected and tracked by our system. We compare these labels with the ground-truth which are the actual real position of the face and hands. The system was tested on video sequences with different scenarios with 3 users. The results of these algorithms for different video sequences are presented at table 3.

Number of frames

Frames with

correct face

tracking

Frames with

correct hands

tracking Sequence François 110 107 100

Sequence Soufiane1 154 150 152

Sequence Soufiane2 694 691 674

Sequence Atousa 145 143 144

Total 1103 1091 1070 Effectiveness = (total of frames with correct tracking / Total number of frames) ×100

99% 97%

Table 3: Results of detection and tracking the body parts for 4 video sequences.

Table 3 shows that the proposed method allows the localization and the tracking of the face with an effectiveness of 99% for the face and 97% for the hands. In special cases in a video sequences, the hand may have a shape similar to that of the face. In this case, the system cannot classify the correct region of the face and consequently, it commits an error in the detection of hands. This explains the lower accuracy of

the system for hands compared to the face. When the person is wearing a short-sleeve, sometimes the hand may have an edges density lower than that of the other extremity of the arm. This represents another source of errors for the system. Concerning the medication bottles, the system was able at any time to detect and track them correctly. To further validate the performance of the detection and tracking part, we evaluated the performance of the detection of the human activity which is in our case the medication intake. Fig. 10 shows three graphs (for Fig. 9 video sequences) which represent the periods when the principal events take place as well as the series of frames recorded by the system representing the duration of each event.

A)

0 50 100 150 200

Frame

Syst

em s

tate

s

Occlusion orcollision betweenhands

Occlusion orcollision betweenhand and face

Occlusion orcollision betweenhands and bottle

B)

0 200 400 600 800

Frame

Syst

em s

tate

s

C)

0 50 100 150 200

Frame

System

states

Fig. 10: Plot of results

152

assumed to correspond to hands. To differentiate the two hands, for each frame, we suppose that the left hand is placed on the left of the right one, which is generally true. An analysis of blobs according to a horizontal position allows us to differentiate between the two hands. Detecting and tracking hand becomes difficult when the person wears a short-sleeved pullover. In this case, it must be possible to locate the hand in the arm. We developed an algorithm which is based on the contours to detect the hand. We use a Sobel edge followed by a dilatation to link edges and obtaining a better definition of the regions’ contours which contains the person’s hands. The result obtained with a contour detection using Sobel method followed by dilation is presented in fig. 6.

Fig. 6: Detection of contours of skin regions. A), B), C): Frames 25, 125 and 325 of the video sequence

test. D), E), F): Detection of contours with the Sobel method followed by dilation.

Once the skin regions contours are found, we define our model for each skin region which does not represent the face. Our rectangular model of the hand for each frame of our video sequences is defined as

H = [B, L, C, F], (11)

where B represents the smallest rectangle that includes the arm region (Bounding Box), C is the center of the rectangle representing the hand, L represents the width of the hand and F is the density of hand’s edges. For each frame, we suppose that hand’s width is approximately 4/5 of face’s width. The hand’s width will be automatically adjusted if the user changes his position in relation to the camera because the size of face’s region will also change. Thus for each skin region which does not represent the face, we start by localizing both extremities of these regions. This localization is performed by using the size of the bounding box and hand. Fig. 7 shows an example of candidate hand regions.

Fig. 7: Detection of the regions which can contain the hand in right arm A), B) Frames 25 and 425 of the video sequence test. C), D) Hand candidates are

the red rectangles.

The best candidate hand region is the one which contains more edges because of the fingers. For both rectangles, we calculate the edge density as defined in [12] with

N

jiIjiI

F Rjicc∑

∈

=

= ),(

1),().(

, (12)

where N is the number of pixels in each rectangle and Ic is the image containing the edges of the skin regions. Finally, we complete our hand model by calculating the center of the rectangle which covers the hand. The tracking for the five following frames is done by comparing the center of the two extracted rectangles, which should include the hand of the current frame, with the previous model center. We supposed that for 5 frames hand does not change enough shape which minimizes the errors of the tracking by the comparison of the model centers. In the event of errors, the following frame which will calculate the edges density will allow the system to catch up and relocate the correct hand position. 6. Medication bottle detection and tracking To validate our algorithm of skin regions detection and our methods of face and hands tracking, we have chosen the activity of medication intake. To test our system, we used medication bottles with different shapes and colors. To create the model of the colors and the shapes of the medication bottles, a group of images which contain regions of each bottle is used for training. The colors histograms and the Hu moments of these regions are calculated. The color thresholds of the table where the bottles will be put are also learned. This will avoid multi-scale searches and false

153

The state Occlusion or collision between hands and bottle is equivalent to the events E1, E2 or E3 of our Petri network. The presence of this state in parallel with the state Occlusion or collision between hands means the occurrence of the event E4. The E6 event occurs when the state Occlusion or collision between hand and face is detected. The end of the state Occlusion or collision between hands and bottle represents the event E9 of our Petri network. Then the token move to place Pf and medication intake is detected. The system has successfully detected the events in the tested video sequences. More tests need to be performed on a larger set of medication intake scenarios to evaluate the success rate. 9. Conclusion

The proposed method for face and hands tracking use color and Hu moment. It does not require any background extraction step and no specific training for the user. Regarding the final results, our method is effective for the gesture detection involved in taking medication and can be extended to other human activities at home or at work in the future.

For future work, 3D positioning using stereoscopy for example, could reduce the number of wrong contacts between objects thus allowing a more efficient management of collisions and occlusions. A localization algorithm for medication bottles that would allow an effective tracking, regardless of occlusions and contact with the hand, would increase the confidence for the Petri net state transition. Finally, an algorithm of face recognition could be added to our system to determine who is doing the activity. 10. References [1] M. Valin, J. Meunier, A. St-Arnaud and J. Rousseau, “Video Surveillance of Medication Intake”, Int. Conf. of the IEEE Engineering In Medicine and Biology Society, New York City, USA, Aug. 2006.

[2] D. Batz, M. Batz, N. da Vitoria Lobo and M. Shah, “A computer vision system for monitoring medication intake”, in Proc. IEEE 2nd Canadian Conf. on Computer and Robot Vision, Victoria, BC, Canada, 2005, pp. 362-369. [3] S. Birchfield, “Elliptical head tracking using intensity gradients and color histograms,” in Proc. IEEE Computer Vision and Pattern Recognition, Santa Barbara, CA, USA, 1998, pp. 232–237.

[4] A.Jacquot,P.Sturn,O.Ruch, “Adapative Tracking of Non-Rigid Object Based on Color Histograms and Automatic Parameter Selection”, Workshop IEEE Motion 2005 Breckenridge,Colorado Janvier 2005. [5] D. Comaniciu, V. Ramesh, P. Meer, “Kernel-Based Object Tracking”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 25, no 5, 2003. [6] Y. Rui and Y. Chen, “Better proposal distributions: Object tracking using unscented particle filter,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Kauai, Hawaii, volume II, 2001, pp. 786–793. [7] Alexandre Lemieux : “Système d’identification de personnes par vision numérique”, Faculté des sciences et de génie, Université Laval, Décembre 2003, pp. 30-36.

[8] Karin Sobottka et Ioannis Pitas : “Face localization and facial feature extraction based on shape and color information”, ICIP, August 1996.

[9] S. Carbini, J.E. Viallet, O. Bernier, “Simultaneous Body Parts Statistical Tracking for Bi-Manual Interactions”, ORASIS, Fournol, France, may 24-27, 2005.

[10] O. Lanz, “Occlusion Robust Tracking of Multiple Objects”, ICCVG (International Conference on Computer Vision and Graphics), Varsovie, Pologne, 2004.

[11] A. Choksuriwong H. Laurent B. Emile. “A Comparative Study of Objects Invariant Descriptors”, ENSI de Bourges - Universite d'Orleans - Laboratoire Vision et Robotique - UPRES EA 2078 10 boulevard Lahitolle, 18020 Bourges Cedex, France.

[12] Linda G. Shapiro & George C. Stockman, Computer Vision, Upper Saddle River, New jersey 07458 page 215-216.

[13] J. L. Peterson, Petri Nets, ACM Computer Surveys, 9:223-252, 1977.

[14] N. Ghanem et. al, “Representation and recognition of events in surveillance video using petri nets,” in Event Detection and Recognition Workshop at ICCV. IEEE, June 2004.

154

[ieee 2008 canadian conference on computer and robot vision (crv) - windsor, canada...

Documents