vision-based hand hygiene monitoring in hospitals

Vision-Based Hand Hygiene Monitoring in Hospitals

CS229 Fall 2015 Project ReportZelun Luo, Boya Peng, Zuozhen Liu

Abstract

Hand hygiene has been shown to bean effective intervention to reduce trans-mission and infections in many studies.This project focuses on interpreting vi-sual clinical data for hand hygiene mon-itoring. We propose two distinct deeplearning approaches to detect hand hy-giene action on manually collected andlabeled data. Specifically, we investigatea fixed-window and a pose-based handdetector using Convolutional Neural Net-work (CNN). We show both approachesare able to achieve high accuracy and out-perform our baseline model using linearSupport Vector Machine (SVM) classi-fier.

1 Introduction

With recent success of deep learning in com-puter vision, visual clinical data can be exploitedto improve the understanding of patient expe-rience and environment during hospital stays.Such data can contain rich information about pa-tient condition such as the appearance of distress,which has been described as the 6th vital [1], aswell as details about the occurrence and natureof clinical activities ranging from patient careto bundle compliance and hand hygiene. How-ever, visual clinical data still remains an under-explored source of information in the healthcaresettings.

In this project, we want to make use of valuablevisual clinical data in the hand hygiene settingwhere the objective is to detect when a person

performs the hand hygiene action using variouscomputer vision and machine learning methods.This action is defined by a person placing his orher hand under a hand hygiene dispenser and re-ceiving soap. Therefore, this problem can be for-mulated as a supervised binary classification taskwith raw visual image inputs.

Provided that deep learning methods haveachieved state-of-the-art performance in variouscomputer vision tasks in recent years, we want toapply CNN to solving this classification task. Weare also interested in exploring pose-based ap-proaches that may be easily extended to monitor-ing clinical activities in the healthcare settings.

2 Dataset

For an initial pilot study, we collect both depthand RGB data from a depth sensor mounted ina lab environment. We collect a dataset of 2-hour depth and RGB signals, from which we ex-tract 25,465 frames containing 630 positive handhygiene frames. Examples of depth images areshown in Fig 1. These frames are then usedto train and evaluate our SVM baseline model,CNN-based hand hygiene detection model, andthe pose-based model. Since the dataset is highlyimbalanced as we have far more negative framesthan positive ones, we use cross validation totune the ratio of positive and negative frames inthe training set.For our pose-based approach, we use thehand dataset found at Oxford Visual GeometricGroup’s repository [2]. We use 11,700 hand im-ages with 12,800 synthesized negative examplesto train the hand classifier.

1

Figure 1: Examples of depth images from our dataset. From left to right, the first two are positiveinstances of hand hygiene, and the last two are challenging negative instances.

3 Approach3.1 SVM baseline model

For each image, we first crop a 64x64 regionthat contains the dispenser. This region encodesrich information regarding the hand hygiene ac-tion and we can train a classifier using featuresdirectly based on raw pixel values. Our classi-fier is a linear SVM model with Hinge loss func-tion. In order to combat overfitting, we also in-corporate L1 regularization into our SVM modelwhich is equivalent to the optimization problemexpressed below.

minγ,w,b1

2‖w2‖+ C

m∑i=1

ξi

s.t. y(i)(wTx(i) + b) ≥ 1− ξi,ξi ≥ 0,

i = 1, . . . ,m

3.2 Fixed window hand hygiene detection

In this approach, we train a CNN model to de-tect whether hand hygiene action occurs in a

video frame. We select the fixed window tobe a cropped region containing the dispenser (a64×64 pixles region near the dispenser).

The network architecture consists of two con-volutional layers, each followed by a max pool-ing layer, and two fully connected layers. Sincethe input images have relative small dimension,we decide that two convolution steps are enoughto extract important high-level feature represen-tations for classification. The output from thefully connected layer is a binary classificationof whether the hand hygiene action is occurring,and we optimize a logistic loss function usingstochastic gradient descent. We then use crossvalidation to tune all hyperparameters to deter-mine the final architecture of our CNN-basedmodel shown in Fig 2.

In our experiments, we compare this modelagainst a similar CNN model trained on the fullframe image (320×240 pixels). In addition,we also compare the accuracy between modelstrained on RGB images and depth images re-spectively.

Figure 2: CNN-based hand hygiene detection model architecture

3.3 Pose-based approach

We train a CNN-based detector for hand jointin each detected human region, since the handis what performs the hand hygiene action. Wethen consider hand hygiene to be performed ifa hand is detected in the physical space imme-diately under the hand hygiene dispenser. Insuch scenario, the hand placement is most likelyto trigger the soap to be dispensed. In this ap-

proach, we first train a CNN-based hand detec-tor that detects hands in a 32×32 pixels region.The model consists of two convolutional layers,each followed by a max pooling layer and thetraining procedure is almost identical to the firstCNN model.

In the classification phase, for each input image,we use the sliding window method to extract re-gions of size 32×32 pixels with a stride of 4 pix-

2

els. We then feed each extracted region into ourhand detector and predict if a hand is detected.Hand hygiene is considered to be performed if

the distance between the closest hand and thedispenser is smaller than a fixed threshold. Fig 3shows the architecture of our pose-based model.

Figure 3: pose-based model architecture

4 ExperimentsUsing the dataset described in section 2, we per-form evaluations on all of the approaches dis-cussed in previous section on both RGB anddepth images. Taking the target class imbal-ance into account, we use Mean Average Pre-cision(MAP) as our metric to reflect the perfor-mance of our different detectors. The results forthese detectors are displayed in Table 1. For pri-vacy reasons, depth images are preferred in real-istic settings and we want to make sure that ourmodels are able to adapt to this challenge. Notethat our hand detector in pose-based approach istrained only on RGB images; therefore, this ap-proach is currently only limited to RGB images.

From Table 1, we observe that both approachesoutperform linear SVM baseline on RGB im-ages. The intuition is that linear SVM baselineimposes strong bias on the hypothesis class inRGB space and the model is underfitting due tolack of complexity. One promising solution isto train SVM with various kernels to learn non-linear boundaries in the RGB space. However,the SVM baseline performs surprisingly well ondepth images. We believe that the reduction indimension from depth images gives rise to a sim-pler decision boundary that can be well repre-sented by a linear boundary.

The CNN model over dispenser region is ableto achieve strong performance on both RGB anddepth images. The results also validate the suc-cess of CNNs achieving state-of-the-art perfor-

mance in vision tasks. However, the results ob-tained from CNN over the entire image are lesssatisfactory due to undesired noise in the rest partof the image. Fig. 4 shows the detection resultson cropped depth images. We observe that theCNN model over dispenser region correctly de-tects hand hygiene action when an arm is stretch-ing out to the dispenser. However, from the falsenegative results shown in the second row, thismodel fails to detect the action when a personcomes too close to the dispenser. This limitationis caused by partial occlusions due to a top-downviewing angle of depth sensor.

The pose-based approach has a worse perfor-mance than fixed-window CNN. One challengeis that people’s hands are often occluded by thedispenser when they are performing hand hy-giene and this method depends on hand detec-tion to make final prediction. Such input framesare likely to be classified as negative as the handdetector is not able to detect any hands. On theother hand, false positive results sometimes oc-cur when people’s hands are within the distancethreshold from the dispenser but are not perform-ing hand hygiene actions. However, this methodhas a potential benefit of being able to tie theaction to its performer. Overall speaking, thismodel achieves decent performance consideringits simplicity. More complex pose-based modelscan be further explored to resolve the challengesmentioned above. Fig. 5 shows different detec-tion results using the pose-base approach.

RGB DepthSVM baseline over dispenser region 0.561 0.889CNN over full image 0.695 0.450CNN over dispenser region 0.956 0.937Pose-base approach 0.807 –

Table 1: Average precision of hand hygiene action detection.

3

Figure 4: Examples of detection results using CNN approach over dispenser region. Green forcorrect labelings; red for incorrect labelings.

Figure 5: Examples of detection results using pose-based approach. Green for correct labelings; redfor incorrect labelings.

5 Conclusion

We believe that the recent success of using ma-chine learning techniques over depth signals toperceive the world could have an unprecedentedimpact in health care. We have shown that it ispossible to detect hand hygiene compliance, animportant component of reducing the cost asso-ciated with hospital-acquired infections.For future work, we plan to work on a pose-estimation model which is invariant to differentview points and can handle self-occlusion. Apose-estimation model allows more general ap-plications such as person identification, activityrecognition and characterization, which can bebroadly applied in a healthcare setting. Althoughthere has been various work focusing on 2D/3Dpose estimation from RGB images [4] [5], workon 3D pose estimation from depth images arerelatively scarce. Shotton et al [3] appear torepresent the state of the art for pose estimation

from depth images, but they still fail to addressthe challenges of handling different camera viewpoints (e.g. top view) and occlusions.

References[1] D. Howell and K. Olsen. Distress the 6th vital

sign. Current oncology, 18(5):208, 2011.[2] A. Mittal, A. Zisserman, and P. Torr. Hand detec-

tion using multiple proposals.[3] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp,

M. Finocchio, R. Moore, A. Kipman, andA. Blake. Real-time human pose recognition inparts from a single depth image. In Proc. CVPR.IEEE, 2011.

[4] J. Tompson, A. Jain, Y. LeCun, and C. Bregler.Joint Training of a Convolutional Network and aGraphical Model for Human Pose Estimation.

[5] C. Wang, Y. Wang, Z. Lin, A. L, Yuille, andW. Gao. Robust estimation of 3d human posesfrom a single image. in IEEE Conference on Com-puter Vision and Pattern Recognition, 2014.

4

vision-based hand hygiene monitoring in hospitals

Documents