figure-ground segmentation improves handled object ...xren/publication/... · figure-ground...

8
Figure-Ground Segmentation Improves Handled Object Recognition in Egocentric Video Xiaofeng Ren Intel Labs Seattle 1100 NE 45th Street, Seattle, WA 98105 [email protected] Chunhui Gu University of California at Berkeley Berkeley, CA 94720 [email protected] Abstract Identifying handled objects, i.e. objects being manipu- lated by a user, is essential for recognizing the person’s ac- tivities. An egocentric camera as worn on the body enjoys many advantages such as having a natural first-person view and not needing to instrument the environment. It is also a challenging setting, where background clutter is known to be a major source of problems and is difficult to handle with the camera constantly and arbitrarily moving. In this work we develop a bottom-up motion-based ap- proach to robustly segment out foreground objects in ego- centric video and show that it greatly improves object recognition accuracy. Our key insight is that egocentric video of object manipulation is a special domain and many domain-specific cues can readily help. We compute dense optical flow and fit it into multiple affine layers. We then use a max-margin classifier to combine motion with empirical knowledge of object location and background movement as well as temporal cues of support region and color appear- ance. We evaluate our segmentation algorithm on the large Intel Egocentric Object Recognition dataset with 42 objects and 100K frames. We show that, when combined with tem- poral integration, figure-ground segmentation improves the accuracy of a SIFT-based recognition system from 33% to 60%, and that of a latent-HOG system from 64% to 86%. 1. Introduction We interact with objects throughout our day-to-day lives. Our daily activities are largely defined by what objects we use and the way we handle them. The capabilities of iden- tifying objects being manipulated in a person’s hands is es- sential to any automatic system if it wants to be of use and assistance to the user. Many studies have shown that rec- ognizing handled objects provides rich context information and is the basis for user activity recognition [15, 26]. While traditional approaches to the problem use environ- (a) (b) (c) (d) Figure 1. (a) Our goal is to recognize what objects a user interacts with, as viewed from a constantly moving wearable/egocentric camera. (b) Distractions from a cluttered background are a major challenge, making it hard to focus on the foreground object. (c) We use motion and domain-specific cues to robustly segment out the foreground. (d) We show figure-ground segmentation greatly improves the recognition of handled objects. mental cameras fixed to walls, advances in camera minia- turization and mobile computing have seen an interesting emerging trend of employing a wearable camera[14], as at- tached to a person’s body, to infer about the status of the user. A wearable or “egocentric” setting enjoys many ad- vantages, such as viewing the activities from a close and natural first-person view, and not needing to instrument the world with wall cameras. It also has its unique challenges mainly from using a small camera that constantly moves. One such challenge is that of background clutter. A minor issue for environmental cameras where background subtrac- tion applies, background clutter becomes a serious problem for wearable cameras. Recent empirical studies (e.g.[18]) have confirmed that background clutter indeed severely im- pairs object recognition in the egocentric setting. In this work we address the background clutter problem in egocentric video head-on. We have two objectives: (1) to use motion to robustly separate the moving hands and 1

Upload: others

Post on 20-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

Figure-Ground Segmentation Improves Handled Object Recognitionin Egocentric Video

Xiaofeng RenIntel Labs Seattle

1100 NE 45th Street, Seattle, WA [email protected]

Chunhui GuUniversity of California at Berkeley

Berkeley, CA [email protected]

Abstract

Identifying handled objects, i.e. objects being manipu-lated by a user, is essential for recognizing the person’s ac-tivities. An egocentric camera as worn on the body enjoysmany advantages such as having a natural first-person viewand not needing to instrument the environment. It is also achallenging setting, where background clutter is known tobe a major source of problems and is difficult to handle withthe camera constantly and arbitrarily moving.

In this work we develop a bottom-up motion-based ap-proach to robustly segment out foreground objects in ego-centric video and show that it greatly improves objectrecognition accuracy. Our key insight is that egocentricvideo of object manipulation is a special domain and manydomain-specific cues can readily help. We compute denseoptical flow and fit it into multiple affine layers. We then usea max-margin classifier to combine motion with empiricalknowledge of object location and background movement aswell as temporal cues of support region and color appear-ance. We evaluate our segmentation algorithm on the largeIntel Egocentric Object Recognition dataset with 42 objectsand 100K frames. We show that, when combined with tem-poral integration, figure-ground segmentation improves theaccuracy of a SIFT-based recognition system from 33% to60%, and that of a latent-HOG system from 64% to 86%.

1. Introduction

We interact with objects throughout our day-to-day lives.Our daily activities are largely defined by what objects weuse and the way we handle them. The capabilities of iden-tifying objects being manipulated in a person’s hands is es-sential to any automatic system if it wants to be of use andassistance to the user. Many studies have shown that rec-ognizing handled objects provides rich context informationand is the basis for user activity recognition [15, 26].

While traditional approaches to the problem use environ-

(a) (b)

(c) (d)Figure 1. (a) Our goal is to recognize what objects a user interactswith, as viewed from a constantly moving wearable/egocentriccamera. (b) Distractions from a cluttered background are a majorchallenge, making it hard to focus on the foreground object. (c)We use motion and domain-specific cues to robustly segment outthe foreground. (d) We show figure-ground segmentation greatlyimproves the recognition of handled objects.

mental cameras fixed to walls, advances in camera minia-turization and mobile computing have seen an interestingemerging trend of employing a wearable camera[14], as at-tached to a person’s body, to infer about the status of theuser. A wearable or “egocentric” setting enjoys many ad-vantages, such as viewing the activities from a close andnatural first-person view, and not needing to instrument theworld with wall cameras. It also has its unique challengesmainly from using a small camera that constantly moves.One such challenge is that of background clutter. A minorissue for environmental cameras where background subtrac-tion applies, background clutter becomes a serious problemfor wearable cameras. Recent empirical studies (e.g.[18])have confirmed that background clutter indeed severely im-pairs object recognition in the egocentric setting.

In this work we address the background clutter problemin egocentric video head-on. We have two objectives: (1)to use motion to robustly separate the moving hands and

1

Page 2: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

the objects-in-hand from the background; (2) to show thatsuch a figure-ground segmentation improves the accuracyof state-of-the-art object recognition systems. Our key in-sight is that we do not need to solve motion segmentation infull: egocentric videos of object manipulation are a specialdomain, and there are domain-specific constraints that makefigure-ground segmentation feasible to solve. We base ourstudies on the recently established Intel Egocentric ObjectRecognition dataset [18] which contains 42 objects in 100Kframes and presents many interesting challenges for bothsegmentation and recognition, such as 3D rotations, occlu-sions, lighting changes and poor video quality.

In Section 3, we present our approach to “backgroundsubtraction” in egocentric video. We compute dense opticalflow and fit the flow into layers. We combine the motionlayers with empirical priors of object location and back-ground movement as well as support mask and color cuesthat are transferred frame-to-frame. We show that such adomain-specific solution produces figure-ground segmenta-tions that are robust to fast hand movement, slow frame rateand variations in object scale and appearance.

In Section 4, we combine bottom-up segmentation re-sults with two state-of-the-art object recognition algorithms,one using exemplar-based SIFT matching [10, 18] and oneusing latent HOG models [6, 7]. We show that figure-ground segmentation improves the recognition accuracy ofboth systems. The improvements become much larger whenwe integrate together per-frame recognition results overtime, attesting to the intuition that segmentation suppressesconsistent distractions from the background and makes er-rors more random. We improve the recognition accuracy ofSIFT matching from 33% to 60% (for 42 objects), and thatof latent HOG from 64% to 86%.

2. Related WorksBackground subtraction is a standard problem for fixed-

location environmental cameras. Many techniques havebeen developed, such as the adaptive mixture-of-Gaussianmodel [23]. For a moving camera, the problem is muchharder and typically requires sophisticated motion analysis.

Works on motion analysis generally follows two linesof approach, working with either sparse features or denseoptical flow. Sparse features are easier to track or match,and feature tracks are often used in motion segmentation(e.g. [9, 24, 28]). Dense optical flow combines motion in-formation on both corners and edges to compute motionvectors at every pixel. Modern optical flow algorithms canhandle large motions and featureless regions (e.g. [4]) at acomputational cost, and motion layers can be extracted fromthe flow (classical works including [25, 1]). Many imagesegmentation algorithms have been applied to video, suchas Normalized Cuts [21] and Graph Cut [27].

Figure-ground segmentation is a special case of segmen-

tation as it targets two well-defined layers, foreground andbackground. Three recent works are particularly relevant:[17] combines figure-ground segmentation with trackingand shows robust segmentation results in sports video. [20]builds a sparse representation for background feature tra-jectories and subtracts it from dynamic scenes. [8] uses tworectangles as the prior to highlight foreground regions ofpeople in movies. In comparison, we study segmentationat a much larger scale (100K frames) and aim at using seg-mentation to improve generic object recognition.

How to combine segmentation and recognition is a pre-vailing topic in vision. Many have shown that recognitioncan help segmentation by providing object shape and/or lo-cation (e.g. [2, 5]). How to use segmentation for recognitionis harder to answer, and existing examples of approaches in-clude using superpixels to represent objects [12] and using atop-down segmentation for verification [16]. [13] shows anexample of object segmentation and recognition in video,by grouping sparse features, as applied to street scenes.

We focus on the problem of egocentric object recogni-tion, object manipulation video as seen from a wearablecamera. Early studies of wearable vision can be foundin [14, 19]. Recently, with the advances in hardware tech-nology, there has been an increasing interest in using wear-able cameras for life assistance. Large datasets and bench-marks are being established for both object and activityrecognition from a wearable camera (e.g. [22]). In par-ticular, we base our studies on the Intel egocentric objectdataset [18], which contains many challenging scenarios forboth segmentation and recognition.

3. Foreground-Background Segmentation inEgocentric Video

We want to solve the figure-ground segmentation prob-lem in egocentric video from bottom-up, i.e. in an objectindependent way, such that figure-ground serves as a pre-processing step and can be used in later stages of object andactivity recognition. Our basic assumption is that the fore-ground moves, however arbitrarily, against a backgroundthat is static in the world frame. For most applications ofobject manipulation and for a typical close-up first-personview which looks down at the front of body, it is reasonableto assume that the moving foreground mostly consists of thehands and the objects in hand.

Figure-ground segmentation in the egocentric setting isno easy problem. The camera moves in arbitrary and un-controlled ways. The foreground, both the objects and thehands, moves rapidly with changing appearance. There arelarge variations in object size and appearance in daily lives.There is no clearly defined background: the foreground ob-ject could occupy as much as 70% of the entire frame. Ego-centric videos also tend to be poor in quality (such as motion

Page 3: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

blur and sensor noise) and/or slow in frame rate due to thesmall form factor of the camera and limited indoor lighting.

How can we solve the figure-ground problem in egocen-tric video without having to solve motion segmentation ingeneral? We observe that there are many regularities inegocentric video, two of which are particularly useful forfigure-ground separation: hands and objects tend to appearnear the center of the view, and body motion tend to besmall and horizontal. We show that these weak domain-specific constraints, when combined with a layered motionanalysis, are sufficient to segment out moving foreground inegocentric video, robust to the challenges listed above. Anillustration of our segmentation pipeline is shown in Fig-ure 2.

Figure 2. An illustration of our segmentation pipeline. We nor-malize optical flow using estimated average background motion,fit the normalized flow into affine layers, and feed the combinedcues into Graph Cut. The resulting segmentation is used to updateappearance models and is transferred to the next frame using flow.

3.1. Normalized Optical FlowWe start our figure-ground analysis by taking a video

frame and computing the dense optical flow v to the ad-jacent frames. We use the coarse-to-fine variational opti-cal flow algorithm of [4]. As seen in Figure 4, dense opti-cal flow is generally good on the background with or with-out texture, but it usually does poorly on the foregroundwhose appearance changes quickly (especially for specu-lar objects). As pointed out in [18], background motion ismuch more consistent than foreground motion for such ob-ject manipulation actions, and that’s what we focus on.

One major problem we have to deal with in egocentricvideo is that the apparent motion generated by body move-ment can be large, such as generated by a sudden body ro-tation, and it changes quickly. We normalize the opticalflow in the following way that compensates for large mo-tion: first, suppose we know the support mask for the back-ground, we compute the average motion of the backgroundv0, and subtract v0 from the flow v; second, we computethe average speed s of the flow ‖v‖ on the background, andnormalize v to v/max(s, 1).

This is of course a chicken-egg problem, as it assumeswe have a background segmentation already. We havefound that, as we only ask for a mean motion at this point,it is relatively easy to do in a video using temporal cues anddomain priors. We will discuss the details in Section 3.3.

3.2. Cues for Figure­Ground SeparationGiven the normalized flow v. We fit the motion into

affine layers R using RANSAC. For each RANSAC iter-ation, we randomly pick 3 pixel locations, estimate a 6-Daffine motion model from their motions, and test the numberof pixels consistent with it. To make the affine fitting robustand to encourage it to return constant motion, we solve anapproximate version of the regularized least squares, wherewe regularize the linear parts of the affine model: let P bethe matrix of the homogeneous coordinates of the points, vthe motion and a the 6-D affine model, the linear solutionsolves the equation v = Pa. We estimate the regularizedaffine model a as a = (P′P + λQ)−1 P′v where Q has 1on the four diagonal entries for the linear coefficients andand all else 0. The affine motion fitting works reasonablywell on the background. We do not expect it to be perfectand fit the whole background into one layer. Instead, we usethe affine layers to “tie together” a large number of pixelsso that we compute the domain specific cues, defined below,on whole groups.

(a) (b)Figure 3. Non-parametric prior distributions estimated fromgroundtruth. (a) The prior on spatial location, dark meaning highprobability. Objects and hands tend to stay in the lower/center re-gion. (b) The covariance prior on motion, shown as iso-probabilityelliptical contours. Things near the top tend to move horizontallyand things to the right up-down, an interesting regularity arisingfrom the mount-point of the camera being on the left shoulder.

We employ two figure-ground cues that are domain-specific to the egocentric setting: location prior, and mo-tion prior, both are defined as non-parametric distributionsestimated from empirical data.

For the location prior Fl, we want an estimate of howlikely each pixel in the image frame belongs to the fore-ground, a priori, given its location. We build this distribu-tion by averaging groundtruth segmentation masks availablefrom the egocentric object dataset. Let {M i} be the set ofgroundtruth binary segmentation masks in the training data,and let M i

σ be the mask smoothed by a Gaussian kernel, wecompute the log average mask Fl(p) = log

(∑M i

σ(p)/N)

, where N is the number of masks, and normalize it to bebetween −1 and 1. We use a large σ (10% of diagonal)to ensure that Fl is smooth. This empirical distribution isshown in Figure 3(a).

For the motion prior Fm, we want an estimate that,given a motion vector v(p) at a pixel location p, how likely

Page 4: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

it is generated from a background motion. The egocentricobject videos contain a fair percentage of frames which donot contain any objects or hands, i.e. all pixel belong to thebackground. We take a random subset of these frames andcompute the covariance of motion vectors, Σm(p) at eachposition p. The motion prior Fm(p,v(p)) is computed aslog(v(p)′Σ−1

m (p)v(p)).In addition to the prior cues Fl and Fm defined at each

pixel, we also compute their averages on the motion layers.For each pixel p, if it belongs a motion layer, it is assignedaverage location prior Fl and average motion prior Fm,as the average of the prior cues on the layer; otherwise, itis set to 0. We also add an indicator variable bR(p), 1 if pbelongs to a layer, 0 if not, for balance.

Note that the motion-based cues {Fm, Fl, Fm, bR} aredependent on a single flow field. In practice, we computemultiple flows from the current frame I0 to its surround-ings Ik, and compute the average of these cues weighted by1/|k|. We use window size 3: k = {−3, · · · , 3}.

In addition to motion and prior cues, we also use twotemporal cues that come from segmentations of previousframes. We maintain two appearance models, in the formof RGB color histograms, one for the foreground, and onefor the background. Let Hf and Hb be the two histogrammodels, for a given color c, we compute the appearancecue as the log likelihood ratio FC(c) = log(Hf (c)/Hb(c)).

We use optical flow to transfer figure-ground masks fromprevious frames: let S be the segmentation mask of the pre-vious frame I−1, 1 if it’s foreground and −1 if background;let v0,−1 be the (backward) flow from the current frame I0

to I−1; for each pixel p in the current frame, we find thecorresponding pixel p′ = v0,−1(p) in the previous and setthe transferred mask as FS(p) = S(p′).

Because of occlusions as well as inaccuracies in the flow,often times there are more than one pixel p that are mappedto the same pixel p′ in I−1. To resolve the ambiguities,we use a second (forward) flow v−1,0 to check how far pis from itself by projecting through forward and backwardflows ‖p − v−1,0(v0,−1(p))‖. We assign p′ to the pixel pwith the smallest error. For the pixels that do not find a cor-responding pixel p′ i.e. undefined, we assign FS(p) = −2.This is based on the intuition that we want more suppressionon the background pixels near an occlusion boundary.

We also use two modified temporal cues as follows: first,for the appearance cue FC , we restrict it to be effectiveonly near the foreground object and not picking up ran-dom patches far away in the background. We use the trans-ferred foreground mask FS > 0, smooth it with a Gaussianto be FS , and compute the restricted appearance cue asFC = FC · FS .

For the transferred mask, We find that it is more useful atplaces where the flow is small; if the flow is large at a pixelp, the correspondences tend to be unreliable, and there is

usually enough information in the motion already to decideon figure-ground. Empirically, we can use the groundtruthsegmentations to measure how reliable the mask transferprocess is, by computing P+(s), the probability that themask transfer is successful (i.e. same figure-ground labelbefore and after), depending on the speed of the normal-ized flow, and P−(s) the probability that it fails. The ratioP+/P− is high when the speed s is small, and falls off as sincreases. We fit a parametric form g(s) = a(s + b)c + d)to P+/P−, and define the gated transferred mask asFS = g(s)FS .

3.3. Figure­Ground SegmentationWe have defined a set of figure-ground features in the

previous section: {Fl,Fm, Fl, Fm, bR,FC , FC ,FS , FS}.These features come from several sources, including do-main priors of location and motion, layers of optical flow,and temporal cues of support mask and appearance. Tocombine these features, We train a max-margin linear clas-sifier, using groundtruth segmentations. Because the fea-tures we use have very clear physical meanings, we con-strain the weights to be non-negative.

To obtain the average motion needed for normalizing op-tical flow in Section 3.1, we use the subset of cues that arenot dependent on normalization: {Fl,FC ,FS} and train alinear classifier. We find that these cues usually provide agood estimate of the average background motion needed fornormalization, and further iterations do not help much.

Finally, we apply the standard Graph Cut algorithm [3]to clean up the figure-ground classification output from thelinear classifier, which optimizes the following energy:

E(l) = −∑

p

(wiFi) l(p) +∑p,p′

b(p, p′)|l(p) − l(p′)| (1)

For the boundary term b(p, p′), we use the probability-of-boundary detector from [11]. We transform the Pb val-ues, which is in [0, 1], to an energy term b by computingb = min(−log(Pb), bbmax). This step helps snap the seg-mentation to the high-quality boundaries produced by Pb.

4. Using Figure-Ground Segmentation forHandled Object Recognition

Once we have a figure-ground segmentation computedfrom bottom-up, we can readily put it to use for recognizingthe objects in the foreground. We combine figure-groundsegmentation with two object recognition systems, one us-ing exemplar-based SIFT matching [10, 18], and one usinglatent HOG models [6, 7].

For SIFT matching, we follow the approach as origi-nally discussed in the Intel egocentric object recognitionbenchmark [18]. First we compute a SIFT matching be-tween a test frame and the set of all exemplar photos of

Page 5: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

Figure 4. Example of figure-ground cues used. From left to right: (1) a video frame; (2) optical flow to the next frame; (3) motion layersextracted from the flow, with unmatched pixels being completely dark; (4) motion error computed from multiple flows, dark meaning high“error” i.e. high likelihood belonging to foreground; (5) support mark transferred from the segmentation of the previous frame, again darkmeaning high foreground likelihood; (6) foreground likelihood computed from online color models; (7) resulting segmentation.

all objects, enforcing fundamental matrix constraint usingRANSAC. We obtain a similarity value for each exemplarphoto. Second, we train a multi-class SVM classifier onthe vector of similarities. It is straightforward to integra-tion figure-ground segmentation into SIFT-based matching:in the matching stage, we ignore the SIFT features that arefound to be in the background region.

For the latent HOG model, we make use of the softwaremade available from the authors of [7]. We keep all thestandard settings of the system, including enforcing sym-metry and using two components in the mixture model. Totest on the egocentric object dataset, we adapt the system tosearch over 9 rotations, as objects in hand may appear in anarbitrary orientation. The HOG models work by scanninga rectangular template built from local image gradients. Itis straightforward to integrate figure-ground segmentationinto HOG: when computing image gradients, we zero outthe gradients in the background region.

4.1. To Use Segmentation or Not?We note that it is not always the case that segmentation

can help recognition. An example can be found in Figure 8,where the coffee machine is static and does not move withthe hands, resulting in a total failure of our segmentationalgorithm. For such an object, or an object with distinc-tive appearance, it may be better to use the full image, notthe segmented foreground. To capture this intuition, we usea second-level class-specific model to combine recognitionresults of both with and without segmentation:

For each frame and each object model i, we have a pairof recognition scores yi

f and yim, without or with the figure-

ground segmentation. We want the combined score to be alinear combination of the two,

yi = wifyi

f + wimyi

m (2)

We train a max-margin classifier to obtain this pair ofweights for each class, with constraints that the weights arenon-negative and sum to 1. The positive training data isfrom the model with the groundtruth label, and the negative

is using the scores from all the other models. We find thatthis linear weighting scheme is effective in choosing whatscore to use in a class-specific way. For example, for thecoffee machine, segmentations are very bad, and yi

m is setto 0. For most of the other objects, segmentation works rea-sonably well and yi

m is much higher than yif .

4.2. Temporal IntegrationAs we will show, figure-ground segmentation improves

per-frame recognition accuracy, but its power is only fullyrevealed when we look at the frames as a continuous videoand consider temporal integration to put together recogni-tion results from a set of adjacent frames.

We make one change to the SIFT matching approach:we replace the 1-vs-1 SVM classifier as in [18] with a 1-vs-all SVM. This change may be minor but is very important:now we have a soft classification score for each frame. Thismakes it much easier to combine recognitions across multi-ple frames in a temporal integration scheme.

We convert latent HOG outputs to soft scores between0 and 1: let yi = wi · x be the response from match-ing the i-th object model wi to image patch x, we firstnormalize yi using the mean and variance of responseson the i-th object. Then we push yi though a sigmoid:pi = 1/

(1 + exp(−αyi − β)

). α and beta are chosen us-

ing the training data, with α ≈ 1.5 and β ≈ −3.0.We use a simple but effective scheme for temporal inte-

gration: suppose we are allowed to look at a set of frames{t}, and {pi

t} is the set of recognition score, between 0 and1, for the i-th object on the frame t. We merge the recogni-tion scores together as follows:

pimerged = 1 −

∏t

(1 − pit) (3)

5. ExperimentsWe conduct our experiments on the newly established

Intel Egocentric Object Recognition dataset [18]. It is acomprehensive dataset of object manipulation as captured

Page 6: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

from a wearable camera, consisting of 10 video sequencesof 42 instances of objects in 5 environmental settings, about100K frames total. The videos are of resolution 384x512and at 15 fps. It is an instance-level recognition task, andthere is no background class. Nonetheless, the problem ischallenging in many other aspects such as 3D rotations andillumination changes.

The benchmark supplies 400 groundtruth object seg-mentations. We use half of the frames in the training se-quences to train our segmentation model, and use the otherhalf for evaluation. We use precision-recall to evaluate theperformance of our segmentation algorithm. For a (soft)figure-ground segmentation, a threshold t and a groundtruthsegmentation S, we compute precision as ‖(M > t) ∩S‖/‖(M > t)‖ and recall as ‖(M > t) ∩ S‖/‖S‖.

In Figure 5 we show the precision-recall evaluations forseveral variations of the segmentation scheme, includinga model using position prior only and not motion prior, amodel using motion prior only, a model using both priors,and a full model using both priors plus temporal cues. Toproduce full precision-recall curves, we use the soft seg-mentation before applying Graph Cut. We observe thatthe full model does work much better than the alternatives,showing that all the features are essential.

For recognition, we test both the SIFT matching systemand the latent HOG system, with and without using figure-ground segmentation. To train the latent HOG models, wemark bounding boxes of all 42 objects in every 10th framein 5 training sequences. We interpolate the bounding boxesto all the frames, obtaining about 500 positives per object.

For per-frame classification without segmentation, theSIFT matching system has 13% accuracy (mean of the diag-onal of the confusion matrix) on 42 objects, marginally bet-ter than that reported in [18]. The latent HOG model takes alot of effort to train and test, but it also performs much betterand achieves 38% accuracy. With figure-ground segmenta-tion, the accuracy of the SIFT system is improved from 13%to 20%, and that of latent HOG from 38% to 46%.

The improvements from using figure-ground segmenta-tion become much more significant when we apply tempo-ral integration. We test the temporal integration scheme inEq. 3 using a fixed-size sliding window, of k frames beforeand k frames after the center frame. Figure 6 shows howthe recognition accuracy grows with time, for both systems,with and without segmentation.

We first see that temporal integration is indeed very use-ful for object recognition in video, improving the accuracyof latent HOG from 38% to 64% with a half window-size of100. Most of the benefits of temporal integration are seenat the very beginning, with the latent HOG system reaching58% accuracy at a half window size of 15 (two seconds).This justifies the intuition that foreground objects rapidlychange pose, and while some of the poses are impossible to

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

motion

position

combined

full

Figure 5. Precision-recall evaluations of our segmentations ongroundtruth masks. The combined model works much better thanany of the individual components that uses location prior only, mo-tion prior only, or using both priors but without the temporal cues.

recognize in that frame (see Figure 9), there is typically an“easy” pose that is close by in time to help nail the object inthe temporal integration.

We also clearly see that figure-ground segmentation im-proves recognition accuracy of both systems, and the effectsbecome much more significant through temporal integra-tion. For example, for the latent HOG system with temporalintegrating with a window size 100, the recognition accu-racy is 86% vs 64%, a 22% improvement, as comparing toa 6% improvement in the per-frame case.

There is a natural explanation for this phenomenon.When the recognition system is run on the whole frame,even for a sliding window approach that looks at a local areafor recognition, there may be distractions in the backgroundthat are consistent over a length of time. One extreme casecan be seen in Figure 1, where there is a water bottle, whichis one of the 42 objects, present in the background. Withoutmotion and figure-ground segmentation, an object recogni-tion system would have no clue how to discard detectionson the bottle. By having access to a figure-ground seg-mentation, noises from the background are suppressed. Theforeground may still be hard to recognize in many cases,as can be seen from the detection examples in Figure 9.The third column in Figure 9 shows an extreme example,where the object is completely occluded by the hands. Insuch cases, without distractions from the background, theerrors in recognition become more random, and thus mucheasier to suppress in temporal integration. Considering thatthe egocentric object dataset has many difficulties such ashand occlusion, 3D rotation and motion blur, the accuracyof 86% on the 42 case is a huge improvement over the 20%accuracy reported in [18], and would be difficult to achievewithout segmentation.

6. ConclusionIn this work we have presented a figure-ground segmen-

tation system for egocentric object manipulation videos ascaptured from a wearable camera. Our solution to the prob-lem is to utilize domain priors cues combined with layered

Page 7: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

Figure 7. Examples segmentations from the Intel egocentric object dataset. The first two pairs are consecutive frames, showing examplesof appearance change and large motion in these egocentric videos. Our segmentation approach is capable of handling small objects andtextureless objects under a variety of backgrounds and lighting conditions.

(a) (b) (c)Figure 8. Some interesting failures: (a) The empty space between the hands lacks pattern and the flow estimated there is usually bindedwith the hands hence wrong. (b) The hand moves up too quickly and the newly unoccluded region is detached from the rest of thebackground. (c) The coffee machine is static and does not move with the hand, thus totally violating our assumptions of the foreground.

optical flow. We show that we can robustly segment outforeground objects and hands on a large dataset of 100Kframes, robust to many challenges such as uncontrolledcamera, large motion, rapid appearance change and lowvideo quality. Moreover, We show that figure-ground seg-mentation significantly improves accuracy in handled objectrecognition, especially when combined with temporal inte-gration, reaching 86% on the challenging Intel egocentricobject recognition benchmark. The system is yet too slowto deploy on portable devices; but we are confident that,with the combination of fast algorithms (such as that foroptical flow) and GPU speed-ups (such as for HOG match-ing), the system can be quickly adapted to reach real-time.We intend to extend our approach to handle more realisticactivity scenarios and for unsupervised object learning.

Acknowledgments. We thank Matthai Philipose forsupport and for many helpful discussions.

References[1] M. Black and P. Anandan. Robust estimation of multiple

motions: parametric and piecewise-smooth flow fields. Com-puter Vision and Image Understanding, 63(1):75–104, 1996.2

[2] E. Borenstein and S. Ullman. Class-specific, top-down seg-mentation. In ECCV, volume 2, pages 109–124, 2002. 2

[3] Y. Boykov and V. Kolmogorov. An experimental comparisonof min-cut/max-flow algorithms for energy minimization invision. IEEE Trans. PAMI, 26(9):1124–37, 2004. 4

[4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High ac-curacy optical flow estimation based on a theory for warping.

Page 8: Figure-Ground Segmentation Improves Handled Object ...xren/publication/... · Figure-ground segmentation is a special case of segmen-tation as it targets two well-defined layers,

Figure 9. Examples of object detection by using figure-ground segmentation and temporal integration with a latent HOG model. Thecombined detector works well under many challenging situations such as occlusion, low contrast and motion blur.

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

temporal window

clas

sific

atio

n ra

te

w/o figure−ground

w/ figure−ground

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

temporal window

clas

sific

atio

n ra

te

w/o figure−ground

w/ figure−ground

(a) (b)

0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6. Recognition results on the Intel egocentric objectdataset [18] showing large improvements due to figure-groundsegmentation. (a) Accuracy of the SIFT matching system, withand without segmentation, varying the window size for tempo-ral integration. (b) Accuracy of the latent HOG system, withand without segmentation. Segmentation improves accuracy from64% to 86%. (c) Latent HOG accuracies on all 42 objects, with orwithout segmentation.

In ECCV, 2004. 2, 3

[5] L. Cao and L. Fei-Fei. Spatially coherent latent topic modelfor concurrent object segmentation and classification. InICCV, 2007. 2

[6] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In CVPR, pages I:886–893, 2005. 2, 4

[7] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. InCVPR, 2008. 2, 4, 5

[8] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressivesearch space reduction for human pose estimation. In CVPR,2008. 2

[9] K. Kanatani. Motion segmentation by subspace separationand model selection. In iccv01, 2001. 2

[10] D. Lowe. Distinctive image features from scale-invariantkeypoints. Int’l. J. Comp. Vision, 60(2):91–110, 2004. 2,4

[11] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat-ural image boundaries using local brightness, color and tex-ture cues. IEEE Trans. PAMI, 26(5):530–549, 2004. 4

[12] G. Mori, X. Ren, A. Efros, and J. Malik. Recovering humanbody configurations: combining segmentation and recogni-tion. In CVPR, volume 2, pages 326–333, 2004. 2

[13] B. Ommer, T. Mader, and J. Buhmann. Seeing the objects be-hind the dots: Recognition in videos from a moving camera.Int’l. J. Comp. Vision, 83(1):57–71, 2009. 2

[14] A. Pentland. Looking at people: sensing for ubiquitous andwearable computing. 22(1):107–119, 2000. 1, 2

[15] M. Philipose, K. Fishkin, M. Perkowitz, D. Patterson,D. Fox, H. Kautz, and D. Hahnel. Inferring activities from in-teractions with objects. IEEE Pervasive Computing, 3(4):50–57, 2004. 1

[16] D. Ramanan. Using segmentation to verify object hypothe-ses. In CVPR, 2007. 2

[17] X. Ren and J. Malik. Tracking as repeated figure-groundsegmentation. In CVPR, 2007. 2

[18] X. Ren and M. Philipose. Egocentric recognition of handledobjects: benchmark and analysis. In First Workshop on Ego-centric Vision, 2009. 1, 2, 3, 4, 5, 6, 8

[19] B. Schiele, N. Oliver, T. Jebara, and A. Pentland. An inter-active computer vision system - DyPERS: Dynamic personalenhanced reality system. In ICVS, pages 51–65, 1999. 2

[20] Y. Sheikh, O. Javed, and T. Kanade. Background subtractionfor freely moving cameras. In ICCV, 2009. 2

[21] J. Shi and J. Malik. Motion segmentation and tracking usingnormalized cuts. In ICCV, pages 1154–1160, 1998. 2

[22] E. Spriggs, F. De La Torre, and M. Hebert. Temporal seg-mentation and activity classification from first-person sens-ing. In First Workshop on Egocentric Vision, 2009. 2

[23] C. Stauffer and E. Grimson. Adaptive background mixturemodels for real-time tracking. In ICCV, 1999. 2

[24] R. Vidal, R. Tron, and R. Hartley. Multiframe motion seg-mentation with missing data using powerfactorization andGPCA. Int’l. J. Comp. Vision, 79(1):85–105, 2008. 2

[25] J. Wang and E. Adelson. Representing moving images withlayers. IEEE Trans. Im. Proc., 3:625–638, 1994. 2

[26] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, andJ. Rehg. A scalable approach to activity recognition basedon object use. In ICCV, 2007. 1

[27] J. Xiao and M. Shah. Motion layer extraction in the pres-ence of occlusion using graph cuts. IEEE Trans. PAMI,27(10):1644–59, 2005. 2

[28] J. Yan and M. Pollefeys. A general framework for motionsegmentation: Independent, articulated, rigid, non-rigid, de-generate and non-degenerate. In ECCV, 2006. 2