learning fisher star models for action recognition in...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #**** CVPR #**** CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Learning Fisher star models for action recognition in space-time videos Anonymous CVPR submission Paper ID **** Abstract State-of-the-art human action classification in challeng- ing video data is currently based on the global aggrega- tion of space-time features to form a structureless, compact histogram representation. Attempts to incorporate struc- ture into this bag-of-features approach are based on spa- tial pyramids, which assume the action parts to appear in similar spatial locations across videos - a reasonable as- sumption for scene categories, less so for space-time ac- tions. In this work we propose to add temporal action struc- ture to space-time feature representations by using pictorial star models. Local video regions are defined as subvolumes scanned densely over a space-time video, and represented by Fisher vectors, which have shown to outperform his- tograms in image classification. Action root and part mod- els are learned discriminatively without the need of ground truth location data by using a weakly supervised multiple- instance learning framework. Our results demonstrate that using a Fisher representation for space-time actions signif- icantly improves on previous state-of-the-art results. More- over, we show that automatically incorporating structure in Fisher discriminative models improves state-of-the-art re- sults on the KTH dataset and shows clear potential for ef- fective action localisation in both space and time. 1. Introduction Human action recognition in challenging video data is becoming an increasingly important research area, given the huge amounts of user generated content uploaded to the In- ternet each day. The detection of human actions will facil- itate automatic video description and organisation, as well as online search and retrieval [21]. Furthermore, human ac- tions provide a natural way to communicate with robots, computer games and virtual environments. Giving machines the capability to recognise human ac- tions from videos poses considerable challenges. The cap- tured videos are often of low-quality, contain unconstrained camera motion, zoom, and shake. In addition, human ac- tions interpreted as space-time sequences trace a very flex- ible structure, with variations in viewpoint, pose, scale, and illumination. Apart from these nuisance factors, actions in- herently possess a high degree of within-class variability: for example, a walking motion may vary in stride, pace and style, yet remain the same action. Creating action models which can cope with this variability, while being able to dis- criminate between a significant number of action classes, is a serious challenge [23]. In this work we tackle the problem of detecting the pres- ence/absence of an action class in a video clip (action clas- sification). Whereas previous state-of-the-art approaches represent each video clip globally through a single Bag-of- Features (BoF) representation, we adopt an approach re- cently proposed in [25] in which local discriminative re- gions are selected and used to train action detectors and video clip classifiers. However, action detectors alone can- not generalise over space-time variations of an action, for example a person waving his hands at different intervals. To this end we make two important contributions. Firstly, we improve on the representation of local space- time regions by using Fisher vectors, which have recently been shown to outperformed BoF histograms for any given vector dimension [8]. Secondly, we augment the Fisher rep- resentation with spatio-temporal structure by learning a pic- torial star model for each action class, as shown in Fig. 1: the resulting models better generalise over the variability of human actions represented as space-time sequences. 2. Previous work Current state-of-the-art human action classification sys- tems rely on the global (video-wise) aggregation of local space-time features [15, 30], without regard to the human’s position in time. For example, in the Bag-of-Features (BoF) approach, a visual vocabulary is initially learned by cluster- ing appearance and motion features into K centroids. A query video clip is then mapped to a single finite dimen- sional histogram, where each bin represents the frequency of occurring visual words, after a hard assignment of each feature to its nearest cluster. Video classification is then per- formed using a state-of-the-art non-linear classifier, such as the χ 2 -kernel SVM. 1

Upload: others

Post on 22-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Learning Fisher star models for action recognition in space-time videos

Anonymous CVPR submission

Paper ID ****

Abstract

State-of-the-art human action classification in challeng-ing video data is currently based on the global aggrega-tion of space-time features to form a structureless, compacthistogram representation. Attempts to incorporate struc-ture into this bag-of-features approach are based on spa-tial pyramids, which assume the action parts to appear insimilar spatial locations across videos - a reasonable as-sumption for scene categories, less so for space-time ac-tions. In this work we propose to add temporal action struc-ture to space-time feature representations by using pictorialstar models. Local video regions are defined as subvolumesscanned densely over a space-time video, and representedby Fisher vectors, which have shown to outperform his-tograms in image classification. Action root and part mod-els are learned discriminatively without the need of groundtruth location data by using a weakly supervised multiple-instance learning framework. Our results demonstrate thatusing a Fisher representation for space-time actions signif-icantly improves on previous state-of-the-art results. More-over, we show that automatically incorporating structure inFisher discriminative models improves state-of-the-art re-sults on the KTH dataset and shows clear potential for ef-fective action localisation in both space and time.

1. IntroductionHuman action recognition in challenging video data is

becoming an increasingly important research area, given thehuge amounts of user generated content uploaded to the In-ternet each day. The detection of human actions will facil-itate automatic video description and organisation, as wellas online search and retrieval [21]. Furthermore, human ac-tions provide a natural way to communicate with robots,computer games and virtual environments.

Giving machines the capability to recognise human ac-tions from videos poses considerable challenges. The cap-tured videos are often of low-quality, contain unconstrainedcamera motion, zoom, and shake. In addition, human ac-tions interpreted as space-time sequences trace a very flex-

ible structure, with variations in viewpoint, pose, scale, andillumination. Apart from these nuisance factors, actions in-herently possess a high degree of within-class variability:for example, a walking motion may vary in stride, pace andstyle, yet remain the same action. Creating action modelswhich can cope with this variability, while being able to dis-criminate between a significant number of action classes, isa serious challenge [23].

In this work we tackle the problem of detecting the pres-ence/absence of an action class in a video clip (action clas-sification). Whereas previous state-of-the-art approachesrepresent each video clip globally through a single Bag-of-Features (BoF) representation, we adopt an approach re-cently proposed in [25] in which local discriminative re-gions are selected and used to train action detectors andvideo clip classifiers. However, action detectors alone can-not generalise over space-time variations of an action, forexample a person waving his hands at different intervals.To this end we make two important contributions.Firstly, we improve on the representation of local space-time regions by using Fisher vectors, which have recentlybeen shown to outperformed BoF histograms for any givenvector dimension [8]. Secondly, we augment the Fisher rep-resentation with spatio-temporal structure by learning a pic-torial star model for each action class, as shown in Fig. 1:the resulting models better generalise over the variability ofhuman actions represented as space-time sequences.

2. Previous workCurrent state-of-the-art human action classification sys-

tems rely on the global (video-wise) aggregation of localspace-time features [15, 30], without regard to the human’sposition in time. For example, in the Bag-of-Features (BoF)approach, a visual vocabulary is initially learned by cluster-ing appearance and motion features into K centroids. Aquery video clip is then mapped to a single finite dimen-sional histogram, where each bin represents the frequencyof occurring visual words, after a hard assignment of eachfeature to its nearest cluster. Video classification is then per-formed using a state-of-the-art non-linear classifier, such asthe χ2-kernel SVM.

1

Page 2: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 1. A handwaving video sequence taken from the KTH dataset [26] plotted in space-time. The action location is described in spaceand time by a pictorial structure model, despite the latter being trained in a weakly supervised framework in which no action locationannotation is available. The detection unit is made up of a root filter, drawn as a red box, and two part filters (shown in green and bluerespectively), linked to the root by green and blue segments. The action score at each root location is computed as the sum of root andpart responses, minus a cost associated with the deformation of each part with respect to its learned anchor point. The star model for therelevant action is detected multiple times, reflecting the continuous nature of the motion in time (best viewed in colour).

The latest action classification systems [7, 29, 25] cur-rently use a histogram representation, even though there isevidence in image classification and retrieval [22, 21, 8] thatthe Fisher representation outperforms BoF histograms.To assess whether Fisher vectors benefit space-time actionrecognition, we modified the standard BoF pipeline to makeuse Fisher vectors and tested it as a baseline algorithm onthe most challenging video datasets. This Fisher baselinemodels the global statistics of local features like standardBoF histograms, making it robust to various part configura-tions and occlusions. However, a single global representa-tion per action clip may include irrelevant or confoundinginformation, such as those feature counts which appear indifferent action classes, and which are not really discrim-inative of the action of interest [25]. On the other hand,local representations are able to capture characteristic ac-tion patches in the video. Fisher vectors are valuable herebecause, instead of having to resort to a linear approxima-tion of the χ2 kernel [28] to cope with thousands of his-tograms per video [25], they make possible to use efficientlinear methods directly. However, both the global and localvideo representations fail to include the geometric structureand temporal relationship inherent in action movement, se-riously limiting the ability to model the complex nature ofhuman actions [31, 7].

Attempts to incorporate action structure into the BoFrepresentation for video classification have been based onthe spatial pyramid approach [17], where each video is di-vided into spatio-temporal grids [15]. In order to model hu-man actions at a finer scale, it is desirable to localise the spa-tial and temporal extent of an action. Initial work by Laptevand Perez [16] learned a boosted cascade of classifiers fromspatio-temporal features, and searched over an exhaustiveset of action hypothesis at a discrete set of plausible spatio-temporal locations and scales. In another approach, Klaseret al. [12] split the task into two: firstly by detecting andtracking humans to determine the action location in space,and secondly by using a space-time descriptor and slidingwindow classifier to temporally locate two actions (phon-ing, standing up). In video event detection, Ke et al. [10]used a search strategy in which oversegmented space-timevideo regions are matched to manually constructed volu-metric action templates. Inspired by the pictorial structuresframework [6], which has been successful at modelling ob-ject part deformations [4], the authors of [10] split theiraction templates into deformable parts making them morerobust to spatial and temporal action variability. To intro-duce temporal structure into the BoF framework, Gaidon etal. introduced Actom Sequence Models, based on a non-parametric generative model [7]. Despite these efforts, the

2

Page 3: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

action localization techniques described [16, 12, 10] requiremanual labelling of the spatial and/or temporal [7] extent ofthe actions/parts in a training set. In contrast, we proposeto learn a simple pictorial star model automatically fromweakly labelled observations, without human annotation.

Following the work of [4], we enrich local Fisher repre-sentations using a simple star-structured part-based model,defined by a ‘root’ filter, a set of ‘parts’, and a structuremodel. In the test video sequence of Fig. 1, the best Fisherstar detection responses are plotted in triplets, where the‘root’ is plotted as a red cube, and the ‘parts’ are plottedas blue and green cuboids respectively. In contrast to ev-eryday objects which have a clear boundary, it is less clearwhich parts of the video make up an action. Thereforewe train discriminative root filters by using a multiple in-stance learning (MIL) framework [1]. However, instead ofconstraining the model to detect rigid subvolumes [25], webreak the subvolume in time and learn a more flexible modelin order to better capture temporal variations in the execu-tion of an action. Moreover, unlike previous works, we usea Fisher representation for each video subvolume, whichhas shown to outperform BoF histograms in image classi-fication [8]. Human actions are subsequently detected byjointly searching for the best root matches and configura-tion of the parts. Thereby, our proposed Fisher star modelsare suitable for both clip classification and localisation invideo datasets without needing accurate labelling of actionlocations or parts.

This work captures two main contributions:

• we improve the Bag-of-features baseline in actionrecognition by bringing Fisher vectors to space-timevolumes, and report significantly improved state-of-the-art results on the most challenging datasets;

• we add a flexible structure to the Fisher representationby casting actions as space-time pictorial star models:these Fisher star models are also able to localise ac-tions without annotation data, improving state-of-the-art results on the KTH dataset.

3. MethodologyThe proposed action recognition system begins by ex-

tracting space-time features from video volumes, and aggre-gating them in local subvolumes to form a Fisher represen-tation (§ 3.1). Subsequently, action root filters are learned ina weakly labelled framework, and split to learn part filters(§ 3.2). Finally, human actions are detected by searching forthe best joint configuration of the Fisher root and part filterresponses (§ 3.3).

3.1. Fisher vector representation

Amongst the wide variety of space-time interest pointdetectors [14, 32] and descriptors [15, 11, 9, 18], the Dense

Trajectory features of Wang et al. [29] give the state-of-the-art results in experiments with challenging action videodata [29, 25]. A Dense Trajectory feature is formed by theconcatenation of five feature types: optic flow displacementvectors, the histogram of oriented gradient and histogram ofoptic flow (HOG/HOF) [15], and the motion boundary his-tograms (MBHx/MBHy) [3]. This feature is computed overa local neighbourhood along the optic flow trajectory, andis attractive because it brings together the local trajectoryshape, appearance, and motion information [29].

Throughout this work, we used the Dense Trajectory fea-tures to describe space-time video blocks. For our Fisherbaseline approach, the features are aggregated globally,whilst for our proposed Fisher star-models, features are ag-gregated within local subvolumes. For both algorithms, theDense Trajectories are transformed into a Fisher represen-tation, which has shown to give excellent performance inimage classification and retrieval [8].

Instead of creating a visual vocabulary by clustering thefeature space by k−means, as done in the BoF approach, forFisher vectors it is assumed that the features are distributedaccording to a parametric Gaussian Mixture Model (GMM).Whereas in BoF, the feature quantisation step is a lossy pro-cess [2], a Fisher vector is formed through the soft assign-ment of each feature point to each Gaussian in the visualvocabulary. Thus, with respect to the mean of each Gaus-sian only, each subvolume is represented by a T ×K × d-dimensional non-sparse vector, where T is the number offeature types which are concatenated, K is the number ofprobabilistic visual words, and d is the dimensionality ofeach feature type [22, 8].

SOME DETAIL, CITE JAAKKOLA 99Using Fisher vectors has an important practical implica-

tion when using high-dimensional local representations forhuman action classification. Since local regions scanneddensely over space-time video volumes can run into hun-dreds of thousands of examples [25], it is impractical tolearn models efficiently using the standard χ2 kernel sup-port vector machine (SVM) in action classification. Con-trarily to expectations, it has been shown that Fisher vectorsachieve top results with linear SVM classifiers, which scalemuch more efficiently with increasing numbers of instancesin the training set [22].

3.2. Learning Fisher star models

The task here is to learn filters for the root and part mod-els to represent each action class. This problem may becast in a weakly labelled framework since it is known thatthere exists at least one positive example of the action ineach video clip, however the exact location of the action isunknown. The set of possible action locations is definedby sliding a subvolume densely over a regular grid. Eachsubvolume is associated with a latent class variable which

3

Page 4: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

identifies which of the subvolumes contains a discrimina-tive action Fisher representation.

We follow the work of [25], in which the MIL frameworkof Andrews et al. [1] was used to simultaneously learn dis-criminative subvolume instances and an action model foreach class. Consider a training set D = (〈X1, Y1〉, ...,〈Xn, Yn〉) that is made up of bags Xi = {xi1, ..., ximi

}with bag labels Yi ∈ {+1,−1}. Each bag corresponds toan action video for which there exists a ground truth label.Each instance in the bag xij represents the jth Fisher vectorin bag i, whose label yij exists but is unknown for the pos-itive training examples (Yi = +1). A bag/video is termedpositive if there exists at least one positive instance in thebag: Yi = maxj{yij}. The task is to recover the latentsubvolume class variables yij and learn an SVM instancemodel 〈w, b〉 in the space of Fisher vectors. The maximumpattern margin formulation of MIL can be written as fol-lows:

minyij

minw,b,ξ

12‖w‖2 + C

∑ij

ξij , (1)

subject to ∀i, j : yij(wTxij + b) ≥ 1− ξij ,ξij ≥ 0, yij ∈ {−1,+1},

and∑j∈i

(1 + yij)/2 ≥ 1 s.t. Yi = +1,

yij = −1 ∀j∈i s.t. Yi = −1,

where w is the normal to the separating hyperplane, b isthe offset, and ξij are slack variables for each instance xij .This mi-SVM formulation results in a semi-convex optimi-sation problem, for which we use the heuristic algorithm ofAndrews et al. [1]. The resulting SVM model 〈w, b〉 repre-sents the root filter in our space-time Fisher star model.

In order to learn space-time part models, we first selectthe best scoring root subvolumes learned in each trainingaction clip. To perform this selection, we first prune over-lapping detections with non-maximum suppression in threedimensions. Subvolumes are considered to be overlappingif their intersection over the union is greater that 20%. Thishas the effect of generating a more diverse sample of highscoring root subvolumes to learn the part models from. Thebest scoring root subvolumes are then split into two equal-sized blocks in time as shown in Fig. 2(a), and Fisher vec-tors are again calculated for each part; finally, part mod-els are individually learned with a standard linear SVM.This approach captures local weak geometry inside the rootnode, and may be seen as a deformable spatial pyramid ap-proach for representing local video regions. During testingthe action parts are allowed to move in space and time, forexample, to better generalise over actions in the same classwarped in time as illustrated in Fig. 2(b).

(a) Training Fisher root and part models. In this illustration, snap-shots ofa person are captured during a jumping action. The most discriminativesubvolumes in the training sequences are selected with multiple instancelearning (MIL), and the top scoring candidates (red solid-line cube) aresplit in time (dotted red line) to learn two part models.

(b) An illustrative test sequence showing the same person in (a) stretchedin time. The root filter alone (solid red cube) will not be able to capture thecurrent deformation, however since the learned parts (solid green and bluecuboids) can move with respect to the root, a part-based model is bettersuited to capture action variations.

Figure 2. Action localization with a temporal pictorial star model.The root subvolume is drawn as a red cube, whilst the green andblue cuboids denote the action parts.

3.3. Action representation and matching

In the following, an action is defined in terms of a col-lection of space-time action-parts in a pictorial structuremodel [6, 5, 10], as illustrated in Fig. 2(b). Following thenotation of Felzenszwalb and Huttenlocher [5], let an ac-tion be represented by an undirected graph G = (V,E),where V = {v1, ...vp} represent the p parts of the action,and (vk, vl) ∈ E represents a connection between part vkand vl. An instance of the action configuration is definedby L = (l1, ...lp), where lk ∈ R3 specifies the locationof part vk. Consider that the detection space S(x, y, z) isrepresented by a feature map H of Fisher vectors for eachsubvolume at position lk, and there exists for each actionpart, a Fisher filter Fk, which when correlated with a sub-volume in H, gives a score indicating the presence of theaction part vk. Thus, F · φ(H, lk) measures the correlationbetween filter F and a feature map H at location lk in thevideo. Let the distance between action parts dkl(lk, ll) be acost function measuring the degree of deformation of con-nected parts from a model. The best action configuration

4

Page 5: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

may be found by maximising the following equation [5]:

L∗ = argmaxL

(p∑k=0

F · φ(H, lk)−p∑k=1

dkl(lk, ll)

), (2)

which optimises the appearance and configuration of the ac-tion parts simultaneously. Felzenszwalb and Huttenlocher[4] describe an efficient method to solve the maximizationof (2) by using a generalized distance transform. Instead,we used an exhaustive search strategy, which is quick be-cause only high scoring root nodes need to be evaluatedwith the part responses, and memory efficient because thereis no need to generate a dense 3D response map for eachpart, which can consume several GB of memory per video.Adapting the efficient search strategy implementation of [4]for videos is left for future work.

We model the relative position of each part with respectto the root node centre of mass as a Gaussian with diagonalcovariance [10]:

dkl(lk, ll) = βN (lk − ll, skl,∑kl) (3)

where lk − ll represents the distance between part vk andvl, sij is the mean offset and represents the anchor points ofeach part with respect to the root, and

∑kl is the diagonal

covariance. The parameter β which adjusts the weightingbetween appearance and configuration scores is set to 0.01throughout. The mean offset is taken automatically from thegeometrical configuration resulting from the splitting of theroot filter, and is set to the difference between the root’s andthe part’s centres of mass. The covariance of each Gaussianis set to half the size of the root filter.

During testing, each action clip sequence is scanned forhigh-scoring Fisher star configurations. A query action se-quence is assigned to the action class for which the maxi-mum response is obtained.

4. Experimental Evaluation & DiscussionIn the following experiments, we first evaluate our Fisher

baseline approach (§ 4.3) on four of the most challengingclassification datasets (§ 4.1), achieving substantially su-perior state of art results. The longer training and evalua-tion time for our proposed Fisher star models meant that wewere able to validate it on the smaller KTH dataset. How-ever, we show state-of-the-art classification and promisingqualitative localisation results (§ 4.4), which we plan to ex-tend in future work (§ 4.5). All our experiments were car-ried out on an 8-core, 2GHz, 32GB memory workstation.

4.1. Datasets and performance measures

The KTH dataset [26] contains 6 action classes each per-formed by 25 actors, in four scenarios. People performrepetitive actions at different speeds and orientations. Se-quences are longer when compared to the YouTube[19] or

the HMDB51[13] datasets, and contain clips in which theactors move in and out of the scene during the same se-quence. We split the video samples into training and testsets as in [26], and consider each video clip in the dataset tobe a single action sequence.

The YouTube dataset [19] contains 11 action categoriesand presents several challenges due to camera motion,object appearance, scale, viewpoint and cluttered back-grounds. The 1600 video sequences are split into 25 groups,and we follow the author’s evaluation procedure of 25-fold,leave-one-out cross validation.

The Hollywood2 dataset [20] contains 12 action classescollected from 69 different Hollywood movies. There area total of 1707 action samples containing realistic, uncon-strained human and camera motion. The dataset is dividedinto 823 training and 884 testing sequences, as in [20], eachfrom 5-25 seconds long.

The HMDB dataset [13] contains 51 action classes, witha total of 6849 video clips collected from movies, thePrelinger archive, YouTube and Google videos. Each ac-tion category contains a minimum of 101 clips. We use thenon-stabilised videos with the same three train-test splits asthe authors [13].

To quantify the respective algorithm performance, wereport the accuracy (Acc), mean average precision (mAP),and mean F1 scores (mF1) [25].

4.2. Feature and representation parameters

The Dense Trajectory features are computed in videoblocks of size 32 × 32 pixels for 15 frames, and a densesampling step size of 5 pixels, as set by default [29]. Eachfeature is split into its 5 types (trajectory 30-D, HOG 96-D, HOF 108-D, MBHx 96-D, MBHy 96-D), and each typeis independently reduced to 24 dimensions using PrincipalComponent Analysis (PCA) [24]. For each feature type, aseparate visual vocabulary is built with 128 Gaussians each,where each dictionary is computed by randomly sampling10,000 features per cluster. These settings have been chosento balance the higher accuracy obtainable with larger dic-tionaries, and the computational and storage cost associatedwith thousands of high dimensional vectors. With these pa-rameters, each Fisher vector is of dimension 5×128×24 =15, 360. We follow Perronnin et al. and apply power nor-malization followed by L2 normalization to each Fishervector component separately [22], before normalising themjointly.

4.3. Fisher baseline algorithm

For the Fisher baseline algorithm, we modified the ac-tion classification pipeline described in [30, 29], by replac-ing BoF histograms with Fisher vectors. Since each video isdescribed by a single Fisher vector, the storage and compu-tational complexity associated with the datasets is relatively

5

Page 6: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Table 1. Fisher baseline quantitative results.KTH Acc mAP mF1State-of-the-art 96.76[25] 97.02[25] 96.04[25]

Fisher baseline 95.83 97.71 95.83YOUTUBE Acc mAP mF1State-of-the-art 84.2[29] 86.10[25] 77.35[25]

Fisher baseline 88.44 92.98 88.27HOHA2 Acc mAP mF1State-of-the-art 39.63[25] 58.3[29] 39.42[25]

Fisher baseline 62.22 55.89 51.22HMDB Acc mAP mF1State-of-the-art 31.53[25] 31.39[25] 25.41[25]

Fisher baseline 47.58 47.46 46.40

low: thus, we report the performance achieved using an ex-act Linear SVM. Multi-class classification is performed us-ing the one-vs-all approach, and we keep C=100 throughout[30, 18].

Clearly, the baseline approach does not take into consid-eration the spatial configuration of the events in the video,or the action dynamics. However, its computational effi-ciency and excellent results, as we will see, cannot be ig-nored. The baseline shows what level of discrimination canbe achieved by global approaches in a dataset, and providesa yardstick for more complex algorithms to surpass. Thecurrent best results reported in the literature and the resultsof our Fisher baseline are listed in Table 1. We will makethe code for the baseline algorithm available online1.

4.4. Fisher star models

The exact same parameter setup as in the baseline wasused to generate Fisher vectors for each subvolume in avideo, in order to train the best root node via MIL. Rootnode subvolumes were scanned over a regular grid with agrid spacing of 20 pixels in space and time. The dimensionsof the root subvolume were set to [60-60-60], the small-est size used in [25], which strikes a good balance betweenthe finer action localisation possible with smaller subvol-umes, and the higher computational cost associated withmany more subvolumes per video. The part model subvol-umes were set to half the size of the resulting learned rootsubvolumes, and scanned along the video at half the gridspacing of the root. In contrast to the Fisher baseline whichpredicts only one class label per action clip, the Fisher starmodels also predicts the location of discriminative parts ofthe action with a space-time pictorial structure model. Thisis clearly seen in Fig. 3, where the top and side views ofa boxing video sequence from the KTH dataset are plot-ted in space-time. Since the number of Fisher vectors per

1URL for code

Figure 3. (a) Top, and (b) side views from a test boxing videosequence in the KTH dataset. Plotted are the top 5 best part con-figurations found in the video volume. The simple tree-structuredstar models are drawn with blue and green links to the root node.

video is now in the thousands, each of ≈15k dimensions,the root Fisher filters are learned via linear SVMs with ahinge-loss in the primal formulation, using stochastic sub-gradient methods [27]. We perform 5-fold cross validation[13] on the training set to pick the SVM regularization cost.

4.5. Discussion

Updating the standard BoF baseline to incorporate Fishervectors instead of histograms has been shown to stronglyimprove results on the most challenging datasets (Table 1).For instance on the YouTube dataset, we report a 4%, 6%,and 11% improvement in accuracy, mAP and F1 score re-spectively. Since the release of the HMDB dataset, featur-ing 51 action classes, the current accuracy we report is morethan double that reported by the authors of the dataset in2011 (23.18%) [13]. On the Hollywood2 (HOHA2) dataset,we achieve slightly lower performance in the mAP of [29],but still increase the previously reported accuracy and mF1scores [25] by a considerable margin. These results stand toshow that the right combination of features, video volumerepresentation, and available linear classifiers can go a longway and produce state-of-the-art results. Interestingly thesame jump in improvement is not seen for the simpler KTHdataset, however there is still a small improvement in termsof mAP. In the second experiment, in which we add struc-ture to the Fisher representation, results on the KTH datasetimprove further in terms of Acc and mF1 (Table 2).

The Fisher star models have so far been evaluated onlyon the KTH dataset, since this dataset is smaller than the restand demands less computational resources. As of writing,we were not able to produce results for any other dataset,and adapting this method to work on thousands of videos

6

Page 7: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 4. The top 5 detected Fisher star configurations detected for each action class in the KTH dataset. The first three actions above showboxing, handwaving, and handclapping video sequences respectively plotted in space and time (a)-(c). In all three sequences, the mainfocus is on the hand movements, and this is reflected by the location of the root and part subvolumes. In the last three sequences (d)-(f),walking, jogging and running, the movement and salient parts of the action in time are also captured by the star models.

in the wild is left for future work. The space-time plotsof Figs. 4(a)-(f) show some of the visual results obtainedwhen detecting actions with Fisher star models; it can beseen that the highest scoring detections do indeed corre-spond to discriminative parts of the action. For examplein the boxing video of Fig. 4(a), the root and part nodes arecentred on the person’s arms, whilst in the running video ofFig. 4(f), the detections appear around the legs of the actor.It is important to point out that in addition to the baseline,which predicts only one action class per video, our methodalso predicts the location of the action with a pictorial starstructure in space and time. The location of the Fisher stardetections is plotted qualitatively because no ground truthlocation data is available for the KTH dataset. Thus, our

Table 2. Fisher star model quantitative results.Dataset KTHPerf. measure Acc mAP mF1State-of-the-art 96.76[25] 97.02[25] 96.04[25]

Fisher baseline 95.83 97.71 95.83Fisher root model 94.91 97.25 94.92Fisher 3-star model 96.76 97.63 96.75

proposed method it is well suited for other action localiza-tion datasets [16, 12, 7], which we leave for future work. InTable 2 we report the Fisher baseline quantitative results (nostructure), the results obtained with a single root node only,

7

Page 8: Learning Fisher star models for action recognition in ...cms.brookes.ac.uk/staff/FabioCuzzolin/files/cvpr2013.pdf · Features (BoF) representation, we adopt an approach re-cently

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#****

CVPR#****

CVPR 2013 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

and the results obtained by including the parts and structure.It can be seen that between the Fisher root model and theFisher 3-star model, there is an appreciable improvement.

5. Conclusion

In this work we proposed to improve human action rep-resentation using the Fisher representation and pictorial starmodels. Our baseline and Fisher star model results demon-strated the superiority of the Fisher vector representationover BoF histograms for human action video classification.Furthermore, by using Fisher star models, we were able tobetter model the variability of human actions in space-timevolumes on the KTH dataset, which is reflected in higherperformance achieved by a 3-part model compared to thatof the root filter alone. Finally, our qualitative results showsthat the Fisher star models are able to capture salient partsof the action sequences. In the near future we plan to ex-tend this work to include pictorial structures with an arbi-trary, and possible variable, number of parts. Our encourag-ing localization results demonstrate the potential for extend-ing this method to larger and more challenging localisationdatasets.

References[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vec-

tor machines for multiple-instance learning. In NIPS, 2003.3, 4

[2] O. Boiman, E. Shechtman, and M. Irani. In defense ofnearest-neighbor based image classification. In CVPR, 2008.3

[3] N. Dalal, B. Triggs, and C. Schmid. Human detection usingoriented histograms of flow and appearance. In ECCV, 2006.3

[4] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained partbased models. PAMI, 32(9):1627–1645, 2010. 2, 3, 5

[5] P. Felzenszwalb and D. Huttenlocher. Pictorial structures forobject recognition. IJCV, 61(1), 2005. 4, 5

[6] M. Fischler and R. Elschlager. The representation andmatching of pictorial structures. IEEE Trans. Computer,22(1):67–92, 1973. 2, 4

[7] A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequencemodels for efficient action detection. In CVPR, 2011. 2, 3, 7

[8] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, andC. Schmid. Aggregating local image descriptors into com-pact codes. PAMI, 34(9):1704–1716, 2011. 1, 2, 3

[9] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In ICCV, 2007. 3

[10] Y. Ke, R. Sukthandar, and M. Hebert. Volumetric featuresfor video event detection. IJCV, 88(3):339–362, 2010. 2, 3,4, 5

[11] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporaldescriptor based on 3D-gradients. In BMVC, 2008. 3

[12] A. Klaser, M. Marszałek, C. Schmid, and A. Zisserman. Hu-man focused action localization in video. In InternationalWorkshop on Sign, Gesture, Activity, 2010. 2, 3, 7

[13] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: A large video database for human motion recogni-tion. In ICCV, 2011. 5, 6

[14] I. Laptev and T. Lindeberg. Space-time interest points. InICCV, 2003. 3

[15] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008. 1, 2, 3

[16] I. Laptev and P. Perez. Retrieving actions in movies. InICCV, 2007. 2, 3, 7

[17] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. In CVPR, 2006. 2

[18] Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchicalinvariant spatio-temporal features for action recignition withindependent subspace analysis. In CVPR, 2011. 3, 6

[19] J. Liu, J. Luo, and M. Shah. Recognising realistic actionsfrom videos “in the wild”. In BMVC, 2009. 5

[20] M. Marszałek, I. Laptev, and C. Schmid. Actions in context.In CVPR, 2009. 5

[21] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier. Large-scaleimage retrieval with compressed fisher vectors. In CVPR,2010. 1, 2

[22] F. Perronnin, J. Sanchez, and T. Mensink. Improving thefisher kernel for large-scale image classification. In ECCV,2010. 2, 3, 5

[23] R. Poppe. A survey on vision-based human action recogni-tion. Image and Vision Computing, 28:976–990, 2010. 1

[24] V. Rokhlin, A. Szlam, and M. Tygert. A randomized algo-rithm for principal component analysis. SIAM Journal onMatrix Analysis and Applications, 31(3):1100–1124, 2009.5

[25] M. Sapienza, F. Cuzzolin, and P. H. Torr. Learning discrim-inative space-time actions from weakly labelled videos. InBMVC, 2012. 1, 2, 3, 4, 5, 6, 7

[26] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: A local SVM approach. In ICPR, 2004. 2, 5

[27] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pe-gasos: Primal estimated sub-gradient solver for svm. Math-ematical Programming, Series B, 127(1):3–30, 2011. 6

[28] A. Vedaldi and A. Zisserman. Efficient additive kernels viaexplicit feature maps. In CVPR, 2010. 2

[29] H. Wang, A. Klaser, C. Schmid, and C. Liu. Action Recog-nition by Dense Trajectories. In CVPR, 2011. 2, 3, 5, 6

[30] H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for action recog-nition. In BMVC, 2009. 1, 5, 6

[31] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation andrecognition. Computer Vision and Image Understanding,115(2):224 – 241, 2011. 2

[32] G. Willems, T. Tuytelaars, and L. V. Gool. An efficient denseand scale-invariant spatio-temporal interest point detector. InECCV, 2008. 3

8