coupling video segmentation and action recognition

8
Coupling Video Segmentation and Action Recognition Amir Ghodrati Marco Pedersoli Tinne Tuytelaars KULeuven ESAT-PSI, iMinds, Belgium {amir.ghodrati, marco.pedersoli,tinne.tuytelaars}@esat.kuleuven.be Abstract Recently a lot of progress has been made in the field of video segmentation. The question then arises whether and how these results can be exploited for this other video processing challenge, action recognition. In this paper we show that a good segmentation is actually very important for recognition. We propose and evaluate several ways to integrate and combine the two tasks: i) recognition using a standard, bottom-up segmentation, ii) using a top-down segmentation geared towards actions, iii) using a segmen- tation based on inter-video similarities (co-segmentation), and iv) tight integration of recognition and segmentation via iterative learning. Our results clearly show that, on the one hand, the two tasks are interdependent and therefore an iterative optimization of the two makes sense and gives better results. On the other hand, comparable results can also be obtained with two separate steps but mapping the feature-space with a non-linear kernel. 1. Introduction Alongside category-level object recognition and detec- tion, action recognition is, arguably, one of the big com- puter vision challenges. The first successes in this domain were obtained by realizing that actions can be considered as spatio-temporal objects and, therefore, the wide gamut of methods developed for object recognition and detection can be extended to action recognition, often in a relatively straightforward way. As one particularly successful ex- ample, building on spatio-temporal versions of 2D interest point detectors [16], bag-of-visual words based image clas- sification has been applied to action recognition [17, 33]. However, the analogy between actions and spatio- temporal objects only holds up to some point. There are also important differences, that lead to particular challenges and limitations as to what can be achieved simply by fol- lowing the analogy: i) actions typically exhibit larger vari- ability, e.g. due to different ways of performing the action or due to camera motion; ii) the number of available train- ing examples is usually much smaller, increasing the risk of overfitting; iii) actions often cannot be delineated accurately by bounding boxes; and iv) actions cannot be localized as precisely as objects, since the spatial and temporal extent of an action is often ambiguous. At the same time, the spatio-temporal nature of actions may also hold opportunities. In particular, video segmen- tation is often more reliable than image segmentation (i.e., more consistent with object boundaries). This is due to the fact that motion brings an important additional cue to de- lineate the objects (or actors) from the background. In this paper, we investigate whether video segmentation can be exploited for improved action recognition. In this context, it is striking that recent work on video segmentation [5, 18] and action recognition [10, 21] build on the same set of low level features, known as trajecto- ries [32]. Trajectories are sampled patches that are tracked over several frames, following the underlying motion of the object or scene. These features are extracted densely over multiple spatial scales and described based on their shape, appearance and motion information. Using trajec- tories, state-of-the-art results have been reported both for action recognition [10, 32] and video segmentation [5, 18]. We therefore start from the representation proposed in [32]. We build on trajectories as low level image rep- resentation, combine them in a simple bag-of-words (BoW) representation and then learn a linear support vector ma- chine (SVM) classifier. Considering this as our baseline, we propose and evaluate several methods that integrate, at dif- ferent levels, segmentation and recognition. Note that, since action localization is often ambiguous, we deliberately fo- cus on the task of action classification rather than detection, even though the obtained segmentations obviously also pro- vide us with some localization information. We focus on the following schemes: 1. Segmentation. We split the representation of a video into action-related foreground and action-unrelated background, and build separate BoW for each of them (see section 3.2). We experiment with both bottom-up as well as top-down segmentation, where the latter is 618

Upload: kuleuven

Post on 06-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Coupling Video Segmentation and Action Recognition

Amir Ghodrati Marco Pedersoli Tinne TuytelaarsKULeuven

ESAT-PSI, iMinds, Belgiumamir.ghodrati, marco.pedersoli,[email protected]

Abstract

Recently a lot of progress has been made in the fieldof video segmentation. The question then arises whetherand how these results can be exploited for this other videoprocessing challenge, action recognition. In this paper weshow that a good segmentation is actually very importantfor recognition. We propose and evaluate several ways tointegrate and combine the two tasks: i) recognition usinga standard, bottom-up segmentation, ii) using a top-downsegmentation geared towards actions, iii) using a segmen-tation based on inter-video similarities (co-segmentation),and iv) tight integration of recognition and segmentationvia iterative learning. Our results clearly show that, on theone hand, the two tasks are interdependent and thereforean iterative optimization of the two makes sense and givesbetter results. On the other hand, comparable results canalso be obtained with two separate steps but mapping thefeature-space with a non-linear kernel.

1. IntroductionAlongside category-level object recognition and detec-

tion, action recognition is, arguably, one of the big com-puter vision challenges. The first successes in this domainwere obtained by realizing that actions can be consideredas spatio-temporal objects and, therefore, the wide gamutof methods developed for object recognition and detectioncan be extended to action recognition, often in a relativelystraightforward way. As one particularly successful ex-ample, building on spatio-temporal versions of 2D interestpoint detectors [16], bag-of-visual words based image clas-sification has been applied to action recognition [17, 33].

However, the analogy between actions and spatio-temporal objects only holds up to some point. There arealso important differences, that lead to particular challengesand limitations as to what can be achieved simply by fol-lowing the analogy: i) actions typically exhibit larger vari-ability, e.g. due to different ways of performing the actionor due to camera motion; ii) the number of available train-

ing examples is usually much smaller, increasing the risk ofoverfitting; iii) actions often cannot be delineated accuratelyby bounding boxes; and iv) actions cannot be localized asprecisely as objects, since the spatial and temporal extent ofan action is often ambiguous.

At the same time, the spatio-temporal nature of actionsmay also hold opportunities. In particular, video segmen-tation is often more reliable than image segmentation (i.e.,more consistent with object boundaries). This is due to thefact that motion brings an important additional cue to de-lineate the objects (or actors) from the background. In thispaper, we investigate whether video segmentation can beexploited for improved action recognition.

In this context, it is striking that recent work on videosegmentation [5, 18] and action recognition [10, 21] buildon the same set of low level features, known as trajecto-ries [32]. Trajectories are sampled patches that are trackedover several frames, following the underlying motion ofthe object or scene. These features are extracted denselyover multiple spatial scales and described based on theirshape, appearance and motion information. Using trajec-tories, state-of-the-art results have been reported both foraction recognition [10, 32] and video segmentation [5, 18].

We therefore start from the representation proposedin [32]. We build on trajectories as low level image rep-resentation, combine them in a simple bag-of-words (BoW)representation and then learn a linear support vector ma-chine (SVM) classifier. Considering this as our baseline, wepropose and evaluate several methods that integrate, at dif-ferent levels, segmentation and recognition. Note that, sinceaction localization is often ambiguous, we deliberately fo-cus on the task of action classification rather than detection,even though the obtained segmentations obviously also pro-vide us with some localization information. We focus on thefollowing schemes:

1. Segmentation. We split the representation of a videointo action-related foreground and action-unrelatedbackground, and build separate BoW for each of them(see section 3.2). We experiment with both bottom-upas well as top-down segmentation, where the latter is

618

explicitly geared towards actions.2. Co-segmentation. In section 3.3 we encourage

consistent segmentations over multiple videos viaco-segmentation. This results in better segmentationand therefore improved recognition accuracy.

3. Iterative learning. A better segmentation is likely toproduce better recognition. At the same time, a betterrecognition can help to induce better segmentation. Insection 3.4 we explain how to iteratively refine bothtasks in a coupled learning framework.

4. Kernels. As amply demonstrated in the literature, thebag of words (BoW) representation can be improvedby mapping the original feature space with a non linearkernel. In section 3.5 we apply this idea to our problemand describe its advantages and disadvantages.

The remainder of the paper is organized as follows. Sec-tion 2 discusses related work. Next, we describe the variousschemes for action classification (Sec. 3) and our experi-mental results (Sec. 4). Section 6 concludes the paper.

2. Related WorkCombined segmentation and recognition. Most work

on object recognition simply uses bounding boxes insteadof segmenting out the object of interest, see e.g. [2]. An ex-ception to this rule is the work on class-independent objectdetection, where segmentation plays an important role asone of the few consistent cues over object categories [1, 7].Also Rosenfeld and Weinshall [23] explicitly link segmen-tation to object recognition, by learning to extract the fore-ground mask from training images.

In the context of action recognition, Ullah et al. [30] ex-periment with segmentations based on motion, action, hu-mans or objects. Action-based segmentation yields the bestrecognition results. However, this scheme relies on an ex-ternal dataset of static images of the same set of actions,which may not always be available, and makes a fair com-parison with alternative schemes difficult. Hoai et al. [12]also looked into the problem of joint segmentation and clas-sification of human actions. However, their work only con-siders temporal segmentation.

Co-segmentation. Another line of work related to oursis co-segmentation. Hochbaum and Sing [13] noticed thatfor co-segmentation, instead of penalizing for variationsbetween foreground descriptors, a similar effect can beachieved by rewarding consistency between them.

Vezhnevets et al. [31] propose a weakly supervised ap-proach that segments images jointly by learning a CRFmodel connecting all superpixels from all training data ina data-driven fashion. Prest et al. [20] extract a fixed num-ber of segments in each video. They assume one of them isthe object and the others are background. They minimize anenergy defined jointly over all training videos to select this

one segment per video. Finally, Rubio et al. [24] extendedstandard co-segmentation to video.

Iterative methods. Deselaers et al. [6] propose a modelthat iteratively localizes the objects and learns class-specificappearance and shape. For each image they find the top100 bounding boxes that are likely to be an object and se-lect the one that best optimizes an energy function definedglobally over all training images. Recently, Shapovalovaet al. [26] have proposed a similar method to localize con-sistent foreground 3D boxes across an action class. Ourproposed method uses a similar iterative scheme but in ourcase the selected foreground is not a 3D box, but a moreexpressive combination of trajectory-groups.

Lan et al. [15] propose a spatial model that considers ac-tion parts as latent variables. They jointly localize and rec-ognize actions using a figure-centric representation. Raptiset al. [21] cluster trajectories into groups, with each groupa candidate for the parts of an action. A part-based modelincorporating both individual appearance and motion con-straints as well as spatio-temporal pairwise constraints. Itlearns to localize not only the action, but also its constituentparts. We believe that the amount of training data in currentaction recognition datasets is probably insufficient to local-ize and learn action subparts. Therefore, we construct a sin-gle bag-of-words descriptor for all the foreground groupsand one for all the background groups. This is more robustthan treating each group separately, as in [21].

Trajectories-based action recognition. In the contextof action recognition, trajectories have recently been usedin various ways, also going beyond the simple bag-of-wordsscheme. The most relevant for us are probably [21] (seeabove), [10] and [25]. Gaidon et al. [10] perform a hierar-chical clustering of the trajectories and propose a kernel thatcomputes the structural and visual similarity of two hierar-chical decompositions. They obtained great results. Ourmethod differs from [10] in that our segmentation into fore-ground and background is not just driven by similarity be-tween groups, but also by top-down cues. Finally, Sapienzaet al. [25] also start from trajectories and learn discrim-inative action subvolumes in a multiple instance learningframework.

3. Coupled segmentation and recognition

In this section we define several schemes that can beused to improve action recognition over our baseline model.First we describe the basic framework (section 3.1). Insection 3.2 we propose methods based on segmentationwhereas in section 3.3 we add co-segmentation. Then insection 3.4 we present a model which simultaneously local-izes actions and learns class-specific foreground and back-ground models. Finally in section 3.5 we consider a non-linear representation of our model.

619

Figure 1. Video segmentation. We cluster the video into severaltrajectory-groups. Then each one is assigned to foreground (redboxes) or background (yellow boxes) based on top-down cues.

3.1. Baseline

Throughout our work we use the dense trajectories pro-posed by [32]. Using these features, they obtain state-of-the-art results with a simple bag-of-words representation.For each trajectory, a local descriptor is computed aroundits 3D volume. As in [32], we use histograms of ori-ented gradients (HOG) to encode static appearance infor-mation, histograms of optical flow (HOF) to encode abso-lute motion information of trajectories, and motion bound-ary histograms (MBH) to capture relative motion (i.e., dis-card constant motion information). We build three code-books, one for each descriptor type, using K-means cluster-ing (W = 4000) on a subset of 100,000 randomly sampledfeatures. The final representation is obtained using a locallylinear constrained coding (LLC) and max-pooling [34]. Forthe BoW baseline, we concatenate for each video the 3 his-tograms and learn a one versus all linear SVM. The trade-off between loss and regularization, C, is set to 100 for thisand for all the following SVM training.

3.2. Segmentation

There are several algorithms for segmenting a video.Most of them generate an over-segmented video volume,which is further used for groupping into super-regions [35,11, 5]. Here, we closely follow the work of [5]. First, webuild a basic bottom-up video segmentation on top of thedense trajectories. We use this video segmentation in twosettings: first, as a fully bottom-up foreground/backgroundsegmentation, and second, as an initial oversegmentation ofthe video in what we call trajectory-groups, that will servelater as intermediate-level representation for our top-downsegmentation. A trajectory-group is a set of spatially closetrajectories with similar motion patterns. We employ threecriteria to define a distance measure between trajectories,

ensuring that we are clustering trajectories that are spatiallyclose to each other, that move together consistently and thatco-exist in a time interval. For trajectories Ti and Tj , we get

d(Ti, Tj) = dS(Ti, Tj).dH(Ti, Tj).dT (Ti, Tj)

where dS is the maximum Euclidean distance betweenpoints of the two trajectories in the same frame, dH isthe Euclidean distance between HOF descriptors and dTmeasures the time intersection of trajectories. If there ex-ists a time interval over which both trajectories co-exist,then dT (.) = 1; otherwise, it is equal to the temporal dis-tance (number of frames) between the trajectories. Notethat this trajectory distance measure is similar to the onesused in [21, 18, 5]. We turn the distances into affinitiesusing a standard exponential with fixed scale γ = 0.1:wi,j = exp(−γd(Ti, Tj)).

For each video, we randomly select 6000 trajectories andbuild a fully connected graph with nodes corresponding totrajectories and wi,j the weight of the edge between nodesi and j, forming an n × n edge affinity matrix with n thenumber of nodes. To assign to each node a label li ∈ Lfrom label set L = l1, ..., lK in an unsupervised manner,we employ the method proposed in [27] which minimizesthe normalized-cut of the graph. Each of theseK clusters isa trajectory-group, described by hi, as shown in Fig. 1.

Bottom-up. By choosing K = 2, our algorithm be-haves as in [5] producing a segmentation of the video intoforeground and background motion. To distinguish be-tween foreground and background we consider the overlapwith the ground truth annotation, assigning the cluster withhigher overlap to foreground1. The final descriptor H ofeach video is the concatenation of hf and hb, with hf (resp.hb ) the LLC coding of the three previously described fea-tures for foreground (resp. background). We refer to thissetting as Bottom-Up segmentation .

Top-down. In this setting we first oversegment the videointo K trajectory-groups (with K > 2). Then, we assigneach trajectory-group to either foreground or background,based on a learnt model that captures the similarities sharedby trajectory-groups across the videos of a certain action.For training we use the bounding box annotations indicatingthe location of the actions to label trajectory-groups to fore-ground or background. The trajectory-group i is assignedto foreground, i.e. xi = +1, if one quarter of trajecto-ries inside the trajectory-group is assigned to the annotatedforeground; otherwise we set xi = 0. A trajectory is as-signed to foreground if the majority of its points lie in theforeground bounding box. For each action, given a set oftraining trajectory-groups, h = h1, ..., hr and their cor-responding labels x = x1, ..., xr, we learn a model Θ

1We realize that using the ground truth annotations for test data is kindof cheating. However, we use this only as a baseline, that corresponds toan upper bound of what can be achieved with bottom-up segmentation.

620

for action-related (foreground) vs. action-unrelated (back-ground) using a linear SVM.

As we use a linear model, ΘThi is the learned scoringfunction for the trajectory-group hi and p(h,Θ) is a func-tion that maps that score to [0, 1]. For each video we wantto find the best configuration of binary labels x in terms ofthe segmentation score. Hence, we define the cost function∑Ki=1 Ω(xi, hi,Θ) where

Ω(x, h,Θ) =

p(h,Θ) if x = 01− p(h,Θ) if x = 1

(1)

It is easy to show that min∑Ki=1 Ω(xi, hi,Θ) =

min∑Ki=1 p(hi,Θ)(1 − xi) + (1 − p(hi,Θ))xi =∑K

i=1 min [(1− 2p(hi,Θ))xi]. So the label of eachtrajectory-group can be obtained with an independent mini-mization. Thus, if p(h,Θ) < 1

2 then h is assigned to back-ground otherwise it is considered foreground. Our final rep-resentation of a video is the concatenation of the foregroundand background histograms, where each of these is obtainedby summing the histograms of trajectory-groups for fore-ground and background respectively: H = [Hf ,Hb] with:

Hf =1

Cf

K∑i=1

hixi, Hb =1

Cb

K∑i=1

hi(1− xi) (2)

where Cf and Cb are the mean value of the euclidean dis-tance between all training samples.

As all foreground trajectory-groups represent human ac-tions they should have quite similar appearance. Therefore,instead of learning a different segmentation model indepen-dently for each action, we can also learn a shared modelfor all actions. Such segmentation is learned by extractinggeneric knowledge about actions and can be considered anactionness detector, similar to objectness [1] or object pro-posals [7] in static images or in video [28]. In the experi-ments we refer to Class-specific segmentation whenusing a different model for each action and Actionnesswhen using a shared model for all actions. Finally, as acontrol, we also experiment with a Random segmentation,where each trajectory-group is assigned to foreground orbackground using a uniform probability distribution.

3.3. Co-segmentation

We can refine the segmentation by encouraging as fore-ground trajectory-groups among different videos with sim-ilar appearance. To this end, we minimize the energy func-tion E defined globally over all training videos of class c:

E(xjj∈c; Θ) =∑hji∈Gv

Ω(xji , hji ,Θ)

− λ∑

(hji ,h

j′i′ )∈Ge

S(hji , hj′

i′ )xjixj′

i′ . (3)

The graphG is built in a data-driven fashion by selecting alltrajectory-groups from all the videos of class c as nodesGv ,and by connecting the k most similar ones to create the edgeset Ge. The unary term Ω is defined as in Eq.(1) and con-siders the cost of the segmentation algorithm (e.g. action-ness). The pairwise cost is defined by S which is a similar-ity function between two trajectory-groups and is computedas: S(h1, h2) = 〈h1, h2〉 and rescaled to [0, 1]. Note thatfor a fully connected graph as in [13], the pairwise potentialturns to

∑j,j′∑l,m

⟨hjl , h

j′

m

⟩xjlx

j′

m =∑j,j′

⟨Hjf , H

j′

f

⟩,

the similarity between the foreground descriptors. Our pair-wise term differs from [31] as we encourage similarity onlyon foreground regions, whereas in [31] the pairwise termencourages also similarity between background regions. λis a parameter to balance between the unary and pairwiseterm. The energy function E is submodular since the pair-wise potential is always negative when xji = 1 and xj

i′ = 1.Thus E can be minimized efficiently employing a graph-cut algorithm [3]. In the experimental results we test theimpact of adding co-segmentation to the previously con-sidered segmentations. Note that one can also add pair-wise constraints within a single video, e.g. between nearbytrajectory-groups. We tried this as well, but did not observeany significant improvement in the results.

3.4. Iterative Learning

We believe that segmentation and recognition can helpeach other as a better segmentation leads to better recogni-tion models and vice versa. The methods described so farfirst solve the segmentation and then use its output duringaction classification. Here, we propose an iterative learningscheme that alternates between segmentation and recogni-tion. The method at the first step segments an action byfinding the labeling vector x for every video and then as sec-ond step it finds the new appearance models for foregroundand background given the current segmentation. These newmodels are then used in the next iteration to get better seg-mentation and so on.

Considering that an initial model for foreground ΘΛf

and background ΘΛbsegmentation is given, the goal of

the first step is to yield the best labeling x for each videoof a certain action. Let Λf and Λb be the scoring func-tion associated to the foreground and background descrip-tors Hf , Hb of a video. We want to find a configu-ration of latent variables x that maximizes the function∑m∈f,b Λm(x;Hm,ΘΛm) i.e. discriminates as much as

possible one class from the others. The fact that Hm is cou-pled to x, makes the maximization hard since Λ dependson the complete descriptor Hm. However, if we choose alinear model for the scoring function then:

621

max∑

m∈f,g

Λm(x;Hm,ΘΛm) = max(

ΘTΛf

Hf + ΘTΛbHb

)

= max

(K∑i=1

ΘTΛfhixi +

K∑i=1

ΘTΛbhi(1− xi)

)

= max

(K∑i=1

(ΘΛf−ΘΛb

)Thixi

)(4)

In our case, each trajectory-group is described by the LLCrepresentation, but then, to make the model linear, the finalmodel is the sum of the score of each trajectory-group, sothat the needed linearity is preserved. In this way, to assignthe latent variables x for a given video, it is enough to scoreΘΛf

−ΘΛbfor each xi independently.

At the second step of the iterative learning,given the labels x of each trajectory-group, wewant to update the global models ΘΛf

,ΘΛb. Our

model representation is the concatenation of allmodels for foreground and background ΘΛ =[ΘΛHOG

f,ΘΛHOF

f,ΘΛMBH

f,ΘΛHOG

b,ΘΛHOF

b,ΘΛMBH

b

].

Similarly to [8], the problem that we want to optimize is nolonger convex in ΘΛ because it depends on the segmenta-tion x. Therefore a standard linear SVM would producepoor results. Instead, we use the strategy proposed in [8]:we fix the values for the positive examples, while for thenegative ones we iteratively search for new configurationsof latent variables until these do not generate any more loss.

The model computed using the iterative learning can alsobe used in conjunction with an independent segmentation,as those defined in section 3.2 or the co-segmentation intro-duced in section 3.3. While all the combinations of inde-pendent segmentation, co-segmentation and iterative learn-ing are tested in the experimental results, here we show justthe case of the combination of all three terms:

E(xjj∈c; Θc) = α∑hji∈Gv

Ω(xji , hji ,Θ)

+ (1− α)∑hji∈Gv

ϕ(xji , hji ,ΘΛfb

)

− λ∑

(hji ,h

j′i′ )∈Ge

S(hji , hj′

i′ )xjixj′

i′ . (5)

Here ϕ is the normalized score of the iterative learningscore, while Ω and S are defined as before. α is set to 0.3 inall experiments and it defines the balance between indepen-dent segmentation and the normalized scores obtained byiterative learning. Note that with this configuration, at testtime, the pairwise potential in 5 depends only on the labels

of the test data (since the training trajectory-group labels areknown), so the pairwise term reduces to a unary term.

As this iterative procedure is non-convex, the final re-sult depends on the initialization of the segmentation. Inthe experimental results we evaluate our procedure usingthree different initializations: Random initialization, an ini-tialization based on Actionness and one based on theground truth Annotations 2.

3.5. Non-linear Kernel

While the iterative learning is a powerful tool, it is lim-ited to linear models. An alternative way to improve re-sults is by mapping the features into a kernel, such that nonlinear classifiers can be used while keeping the learningoptimization convex. Excluding the iterative learning, allthe other combinations of different segmentation and co-segmentation can be used together with a kernel. Consid-ering H = [Hf ,Hb] the global descriptor of a video, itcan be decomposed into 6 channels corresponding to HOG,HOF and MBH for foreground and for background. Asin [30, 32], for classification, different descriptors are com-bined in a multi channel approach:

K(Hi,Hj) =∑c

exp(− 1

Acdχ2(Hc

i ,Hcj))

where dχ2 is the χ2 distance measure and Ac is the meanvalue of distances between all training samples of the c-thchannel. We train a model for each category and classifytest samples using the one-versus-all strategy. While forlinear kernels the best coding is LLC, for non linear kernelswe also test the method with hard quantization and aver-age pooling. The different configurations of the method areevaluated in section 4.

4. Experimental ResultsDatasets. The YouTube [19] dataset contains 11 ac-

tion categories. It contains 1600 low quality (240×320)videos. We follow the original setup using leave-one-group-out cross validation for a pre-defined set of 25 groups. Thisdataset is challenging due to large variations in cameramotion, object appearance and pose, object scale, view-point, cluttered background and illumination conditions.We report average accuracy over all classes, as also donein [9, 32].

The UCF-Sports [22] dataset contains 10 action cate-gories. It consists of 150 video samples, extracted fromsport broadcasts. For this dataset we follow the leave-one-video-out (LOO) strategy as in [32] as well as the singlesplit proposed in [15]. Note that for the LOO evaluation we

2Note that here we only use ground truth annotations of training data,not for test data.

622

Segmentation Acc.GT trajectory 90.2%GT trajectory-group 88.5%BoW 83.5%Bottom-Up 83.6%Action-specific 80.3%Actionness 85.0%Random 84.1%

Table 1. Baseline accuracies on YouTube dataset. The reportedresults are averaged over all classes.

include in the training data also the horizontally flipped ver-sion of each video as in [32], while for the single split wedo not as in [15].

We use the YouTube dataset that contains many morevideos and has more stable results to evaluate individuallyeach component of our system. Next, for a comparison withthe state-of-the-art, we report results on both datasets.

Parameters Evaluation. We evaluate the impact of dif-ferent trajectory length with our baseline BoW configurationon YouTube dataset using cross validation protocol. Thebest performance is obtained with L = 15. Using length of5 and 40, the accuracy is 3.8% and 0.5% lower respectively.We also evaluate the number of trajectory-groups K to usefor the top-down segmentation with the Actionnes con-figuration. We obtain the best accuracy for K = 20. Thealgorithm is less sensitive to K and for K = 5 and K = 50the acuracy is 0.7% and 1% lower respectively. In the restof the experiments we fix L to 15 and K to 20.

Segmentation. In this section we evaluate the impactof different segmentations on recognition accuracy. As ex-plained in section 3.1, the representation is still a BoWbased on trajectories, but this time we learn two differentmodels, one for foreground and one for background.

In table 1 we report results for different segmentations onYouTube dataset. In the first two rows we first evaluate howmuch a perfect segmentation (using ground truth test data,once at trajectory-level and once at trajectory-group level)can help to improve the final recognition. As expected, inboth cases the obtained improvement over the BoW baselineis large (resp. 7% and 5%) and brings performance abovethe current state-of-the-art. This proves that segmentationis crucial for good recognition. The difference between thetwo GT accuracies is due to the coarse representation ofour trajectory-groups that produces some quantization er-rors. As in all the following experiments we will be usingthe same trajectory-group representation, we already knowthat our best performance is probably bounded to 88.5%.

We first evaluate the Bottom-Up segmentation (seesection 3.2). However, with this segmentation the improve-ment over a global bag-of-words is minimal. Next, the im-pact of top-down Action-specific and Actionnesssegmentation is reported. While the action-specific classi-fier performs poorly, the actionness does a better job im-

Co-segmentation Acc. Impr.Action-specific 77.0% -3.3Actionness 85.1% +0.1Random 83.6% -0.5

Table 2. Mean accuracies for co-segmentation on YouTube dataset.Improvements (Impr.) are relative to the corresponding segmenta-tion of table 1.

proving the baseline by 1.5%. Although the two modelsare driven by the same top-down principle, in the first caseprobably the classifier did not have enough data to performa good and stable segmentation. Finally, we also reportthe accuracy for Random segmentation, where trajectory-groups are randomly assigned to background and fore-ground. Surprisingly this model performs better than BoW,bottom-up and action-specific.

Co-segmentation. An orthogonal way to improve seg-mentation is co-segmentation (see section 3.3). We use thesimilarity among the foreground trajectory-groups from dif-ferent videos of the same action to refine their segmenta-tion. For all experiments, we set the ratio between unaryand pairwise terms λ = 0.2 (see Eq.3). In table 2 we re-port accuracy and the relative improvement produced by co-segmentation. At this point, the improvement brought bythe co-segmentation seems to be limited or even negative.However, as we will see in the following sections, when thesegmentation gets better, the impact of co-segmentation be-comes more important.

Iterative Learning. Next we evaluate the effect of theiterative learning (see section 3.4). In this experiment wecan use different segmentations for unary term in Eq.(5) aswell as for the initialization for the iterative procedure (seetable 3). For the unary term we test: Itr., which is the it-erative method using just the linear model defined in Eq.(4),+Act., where we consider an additional unary term basedon actionness, +Co-seg., where we consider also co-segmentation, and +Act.+Co-seg., where in conjunc-tion with the iterative learning we use both actionness andco-segmentation. Also, as the optimization in this case isnon-convex, the initialization of the segmentation is impor-tant for good results. We test three different initializationsof the segmentations: random (Rnd.), based on actioness(Act.) and based on the ground-truth annotations (Ann.).

As action-specific always performs worse than action-ness in previous tables we did not report results for the for-mer. It is particularly interesting that the iterative learningwith random segmentation for initialization and cosegmen-tation (denoted in the table as (0)) does not use ground-truthannotations on the training data but still is 2.2% better thatthe BoW baseline. As expected, using the ground-truth an-notations of the training data as initialization the best resultis obtained, although it is not far from those obtained by ac-tionness. So actionness seems a good approximation of the

623

Iterative Learning Init. Acc. Impr.Itr. Ran. 85.0% +1.5+Act. Rnd. 85.2% +0.2+Co-seg.(0) Rnd. 85.7% +2.2+Act.+Co-seg. Rnd. 85.7% +0.6

Itr. Act. 85.2% +1.7+Act. Act. 86.1% +1.1+Co-seg. Act. 86.2% +2.7+Act.+Co-seg. Act. 86.7% +1.6

Itr. Ann. 85.5% +2.0+Act. Ann. 86.4% +1.4+Co-seg. Ann. 86.2% +2.7+Act.+Co-seg.(1) Ann. 86.7% +1.6

Table 3. Mean accuracies for iterative learning. Improvements(Impr.) are considered with respect to Actionness of ta-ble 1 for +Act., with respect to Actionness of table 2 for+Act.+Co-seg. and with respect to BoW for the rest.

Kernel Pool. Acc. Impr.Random llc 84.1% +0.0Random+Co-seg. llc 83.9% +0.3Act. llc 85.7% +0.7Act.+Co-seg. llc 85.4% +0.3Random hard 84.2% +0.1Random+Co-seg. hard 84.2% +0.6Act. hard 86.2% +1.2Act.+Co-seg.(2) hard 86.8% +1.7

Table 4. Mean accuracies for kernels. Improvements (Impr.) arerelative to the corresponding configurations of tables 1 and 2.

ground truth segmentation.Non Linear Kernel. An alternative way to improve re-

sults is by mapping the features into a kernel, such that nonlinear classifiers can be used. For this experiment we use thenon linear kernel and settings defined in section 3.5. Unfor-tunately, we cannot combine these kernels with the iterativelearning because the latter works only with linear kernels.We notice that when using kernels the normalization of theforeground and background descriptors highly affects theaccuracy. Thus, in all the experiments that employ kernelswe use a 5-fold cross-validation to yield the best normal-ization among `1, `2 and unnormalized descriptors. Wefound that for the YouTube dataset, the best is `1 normal-ization, whereas for the UFC-sports dataset the best is tonot normalize the descriptors. In table 4 we report the ac-curacy of different configurations with sparse-coding+max-pooling (llc) and with hard-quantization+average-pooling(hard).

Comparison with state-of-the-art. We compare themost promising configurations of our method with the state-of-the-art. We select configuration (0), i.e. the iterativelearning with co-segmentation, configuration (1), which isalso based on iterative learning but now initialized withground truth annotations and combined with both action-ness and co-segmentation and (2) which is based on action-ness and co-segmentation mapped with a non linear kernel.Since each of (0),(1) and (2) produce different segmenta-

Figure 2. Per-class classification accuracy for UCF-Sports dataset.

Method Acc. Impr.Brendelet al. [4] 77.8% -Wang et al. [32] 84.2% -Sapienza et al. [25] 80% -Gaidon et al. [9] 87.9% -(0) 85.7% +2.2(1) 86.7% +3.2(2) 86.8% +3.3(0)+(1)+(2) 87.4% +3.9

Table 5. Performance comparison on YouTube dataset with thestate-of-the-art. Improvements (Impr.) are relative to baseline BoWwithout any segmentation that is reported in table 1

tions, we finally report a weighted combination of the 3classifiers in the last row of the table. The weights are se-lected by cross-validation on the training set.

Table 5 compares our results on YouTube dataset. Im-provements respect to our baseline BoW range from +2.2for configuration (0), which has the advantage of not usingthe ground-truth bounding box, to +3.3 for configuration(2). This clearly shows that segmentation plays a relevantrole for the final recognition accuracy and that there are sev-eral methods that can readily provide that useful segmenta-tion. Comparing our methods with the state-of-the-art weobserve that each one of the 3 methods is already compara-ble or better than most of the state-of-the-art excluding [9].However, when combining our 3 segmentations we obtain arecognition accuracy comparable to [9].

Table 6 compares our results on UFC-sports dataset. Inthis dataset the non linear kernel approach based on action-ness and cosegmentation (2) seems better than the iterativelearning. Also for this dataset the best result is obtainedwith the weighted combination of the 3 methods, whichgives a final accuracy of 90.6 and 86.1 for the LOO eval-uation and single split respectively. Figure 2 shows a moredetailed, per class comparison between Actionness, themethod of [21] and the BoW baseline.

5. Conclusion

In this paper we have shown that a good video segmen-tation is fundamental to obtain accurate action recognition.To this end we have proposed and evaluated several waysto integrate video segmentation and action recognition. Inparticular we have observed that coupling segmentation andrecognition in an iterative learning can always improve the

624

Method Acc. LOO Acc. SplitWang et al. [32] 88.2% -Kovashka et al. [14] 87.27% -Todorovic et al. [29] 92.1% 86.8%Lan et al. [15] - 73.1%Raptis et al. [21] - 79.4%Shapovalova et al. [26] - 75.3%(0) 76.9% 80.1%(1) 88.7% 81.5%(2) 90.0% 86.1%(0)+(1)+(2) 90.6% 86.1%

Table 6. Performance comparison on UCF-sports dataset with thestate-of-the-art based on Mean per-class accuracies.

action recognition accuracy. Also, an alternative way to ob-tain similar results is to map the foreground and backgrounddescription of the video into a non linear kernel. Finally, wehave shown that joining different segmentations can furtherimprove our results and reach state-of-the-art performance.

6. AcknowledgmentsThis work was supported in part by FP7 project AXES, the

FWO project ”Monitoring of abnormal activity with camera sys-tems” and FP7 ERC Starting Grant 240530 COGNIMUND.

References[1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In

CVPR, 2010.[2] H. Bilen, V. Namboodiri, and L. Van Gool. Object and action

classification with latent variables. In BMVC, 2011.[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-

ergy minimization via graph cuts. In PAMI, 2001.[4] W. Brendel and S. Todorovic. Activities as time series of

human postures. In ECCV. 2010.[5] T. Brox and J. Malik. Object segmentation by long term

analysis of point trajectories. In ECCV, 2010.[6] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects

while learning their appearance. In ECCV, 2010.[7] I. Endres and D. Hoiem. Category independent object pro-

posals. In ECCV, 2010.[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-

manan. Object detection with discriminatively trained part-based models. In PAMI, 2010.

[9] A. Gaidon, Z. Harchaoui, and C. Schmid. A time series ker-nel for action recognition. In BMVC, 2011.

[10] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing ac-tivities with cluster-trees of tracklets. In BMVC, 2012.

[11] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hi-erarchical graph-based video segmentation. In CVPR, 2010.

[12] M. Hoai, Z. Lan, and F. De la Torre. Joint segmentation andclassification of human actions in video. In CVPR, 2011.

[13] D. S. Hochbaum and V. Singh. An efficient algorithm forco-segmentation. In ICCV, 2009.

[14] A. Kovashka and K. Grauman. Learning a hierarchy of dis-criminative space-time neighborhood features for human ac-tion recognition. In CVPR, 2010.

[15] T. Lan, Y. Wang, and G. Mori. Discriminative figure-centricmodels for joint action localization and recognition. InICCV, 2011.

[16] I. Laptev and T. Lindeberg. Space-time interest points. InICCV, 2003.

[17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008.

[18] J. Lezama, K. Alahari, J. Sivic, and I. Laptev. Track to thefuture: Spatio-temporal video segmentation with long-rangemotion cues. In CVPR, 2011.

[19] J. Liu, J. Luo, and M. Shah. Recognizing realistic actionsfrom videos ”in the wild”. In CVPR, 2009.

[20] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Fer-rari. Learning object class detectors from weakly annotatedvideo. In CVPR, 2012.

[21] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim-inative action parts from mid-level video representations. InCVPR, 2012.

[22] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach:a spatio-temporal maximum average correlation height filterfor action recognition. In CVPR, 2008.

[23] A. Rosenfeld and D. Weinshall. Extracting foregroundmasks towards object recognition. In ICCV, 2011.

[24] J. C. Rubio, J. Serrat, and A. Lopez. Video co-segmentation.In ACCV, 2012.

[25] M. Sapienza, F. Cuzzolin, and P. Torr. Learning discrimi-native space-time actions from weakly labelled videos. InBMVC, 2012.

[26] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, andG. Mori. Similarity constrained latent support vector ma-chine: An application to weakly supervised action classifica-tion. In ECCV, 2012.

[27] J. Shi and J. Malik. Normalized cuts and image segmenta-tion. In PAMI, 2000.

[28] S. Stalder, H. Grabner, and L. Van Gool. Dynamic objectnessfor adaptive tracking. In ACCV, 2012.

[29] S. Todorovic. Human activities as stochastic kroneckergraphs. In ECCV. 2012.

[30] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-of-features action recognition with non-local cues. In BMVC,2010.

[31] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly su-pervised semantic segmentation with a multi-image model.In ICCV, 2011.

[32] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recog-nition by dense trajectories. In CVPR, 2011.

[33] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for action recog-nition. In BMVC, 2009.

[34] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-constrained linear coding for image classification.In CVPR, 2010.

[35] C. Xu and J. J.Corso. Evaluation of super-voxel methods forearly video processing. In CVPR, 2012.

625