budget-aware activity detection with a recurrent policy ... · proach [10]. however, these methods...

MAHASSENI, YANG, MOLCHANOV, KAUTZ: BUDGET-AWARE ACTIVITY DETECTION 1

Budget-Aware Activity Detection with ARecurrent Policy Network

Behrooz Mahasseni†1

https://mahasseb.github.io

Xiaodong Yang2

http://xiaodongyang.org

Pavlo Molchanov2

[email protected]

Jan Kautz2

http://jankautz.com

1 Oregon State University2 NVIDIA Research

Abstract

In this paper, we address the challenging problem of efficient temporal activity de-tection in untrimmed long videos. While most recent work has focused and advancedthe detection accuracy, the inference time can take seconds to minutes in processing eachsingle video, which is too slow to be useful in real-world settings. This motivates theproposed budget-aware framework, which learns to perform activity detection by intel-ligently selecting a small subset of frames according to a specified time budget. Weformulate this problem as a Markov decision process, and adopt a recurrent network tomodel the frame selection policy. We derive a recurrent policy gradient based approachto approximate the gradient of the non-decomposable and non-differentiable objectivedefined in our problem. In the extensive experiments, we achieve competitive detectionaccuracy, and more importantly, our approach is able to substantially reduce computationtime and detect multiple activities with only 0.35s for each untrimmed long video.

1 IntroductionEfficient temporal activity detection in untrimmed long videos is fundamental for intelligentvideo analytics including automatic categorizing, searching, indexing, segmentation, and re-trieval of videos. This is a challenging problem as algorithms must (1) determine whethera specific activity occurs in an untrimmed video; (2) identify the temporal extent of eachactivity; and (3) maximize detection accuracy within a given time budget. In temporal ac-tivity detection, the most time consuming step is the execution of CNNs or hand-craftedfeature extractors to every sliding window or proposal segment [4, 8, 30], typically taking,e.g., seconds to minutes to process one video in THUMOS14 [9]. Unfortunately, this rulesout the practical use of these methods for the applications that require real-time and large-scale video processing. Although hardware solutions in some scenarios can help meet theconstraints, it is equally important to establish a better understanding of how to achieve amaximal detection accuracy given the constraints on time and resource.

c© 2018. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.†Work done during an internship at NVIDIA Research.

Citation

Citation

{Escorcia, Heilbron, Niebles, and Ghanem} 2016

Citation

Citation

{Heilbron, Niebles, and Ghanem} 2016

Citation

Citation

{Yuan, Ni, Yang, and Kassim} 2016

Citation

Citation

{Idrees, Zamir, Jiang, Gorban, Laptev, Sukthankar, and Shah} 2017

2 MAHASSENI, YANG, MOLCHANOV, KAUTZ: BUDGET-AWARE ACTIVITY DETECTION

frames

LSTM

frames

LSTM

frames

LSTM

neighborhood backgroundframe foregroundactivityframe

policystep policysteppolicystep

Figure 1: Given an untrimmed long video, at each step t the policy has access to the localobservation ot of a neighborhood centered around the current selected frame at ξt . For eachstep, the policy predicts a segment mt and produces three outputs: the temporal location lt(i.e., start and end) of the segment, the estimated class ct associated with the segment, andthe next frame to observe at ξt+1. According to a specified time budget, the policy runs forT steps then completes the detection process.

Recently, there is a fast growing interest in the research of temporal activity detection.Most existing work [11, 12, 18, 22, 30] hinges on a large set of features and classifiersthat exhaustively run over every time step at multiple temporal scales. This sliding windowscheme is obviously computationally prohibitive for applications such as the ones runningon mobile and embedded devices. To avoid such exhaustive evaluations, a number of actionproposal algorithms [4, 8, 14] have been proposed to produce a set of candidate temporalsegments that are likely to contain a certain action. A separate classifier is then applied onthese proposal segments for action classification. However, we argue that it is suboptimal todivide temporal activity detection into the two disjointed steps: proposal and classification.Moreover, the large number of proposal segments, e.g., thousands of proposals per video, isstill unsatisfying in term of computational efficiency.

In this paper, we address this problem by introducing a fully end-to-end and budget-aware framework, which learns to optimally select a small number of video frames accordingto a time budget to perform temporal activity detection. We formalize our frame selectionas a Markov decision process (MDP), and adopt a recurrent network to model the policyfor selecting frames. We develop a policy gradient approach based on reinforcement learn-ing to approximate the gradient of our non-decomposable and non-differentiable objectivefunction. Figure 1 illustrates the detection process of our approach.

Our main contributions of this paper are summarized as follows. First, we achieve com-petitive accuracy by using only 0.35s for each untrimmed minutes-long video, speeding updetection by orders of magnitude compared to the existing methods. Second, we present afully end-to-end policy based model with a single training phase to handle the activity classesin their entirety. Third, to our knowledge, we provide the first approach to directly optimizethe mean average precision (mAP) criteria, i.e., the final evaluation metric, in the objectivefunction for activity detection.

Citation

Citation

{Montes, Salvador, Pascual, and Giro} 2016

Citation

Citation

{Oneata, Verbeek, and Schmid} 2014

Citation

Citation

{Singh and Cuzzolin} 2016

Citation

Citation

{Wang, Qiao, and Tang} 2014

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Shou, Wang, and Chang} 2016


2 Related WorkA large family of the research in video activity understanding is about activity classification,which provides useful tools for temporal activity detection, such as the two-stream networkswith separate CNNs operating on the color and optical flow modalities [16, 27], and theC3D and I3D using 3D-CNN to process short video clips [1, 21]. RNNs can be applied withCNNs to model temporal dynamics and handle variable-length video sequences [6, 28].

Another active research line is the spatio-temporal action detection, which focuses onlocalizing action regions over consecutive video frames. A number of methods have beenproposed, from the spatio-temporal proposals such as supervoxels [19], the frame-level ob-ject detection followed by a linking or tracking algorithm [5], to the recent video SSD ap-proach [10]. However, these methods are mostly applied on short video snippets, in contrast,temporal activity detection targets at untrimmed videos that involve complex actions, objectsand scenes evolving over long sequences. Therefore, efficient inference under a certain timebudget is much in demand for the temporal activity detection task.

A majority of the existing approaches [11, 12, 18, 22, 30] for temporal activity detectionfocus on extracting various features to represent sliding windows and subsequently classify-ing them with SVMs trained on the multiple features. Alternative proposal based methodscan be used to generate action segments to replace the exhaustive sliding windows. A sparselearning based framework is presented in [8] to produce segment proposals with high recall.Escorcia et al. [4] introduce a proposal algorithm based on C3D and LSTM. Shou et al. [14]propose a 3D-CNN based multi-stage framework with three different networks for proposal,classification and localization. A convolutional-de-convolutional network is further intro-duced in [15] to boost the precision for temporal boundaries of proposal segments. Theseproposal based detection methods are by nature stage-wised, and therefore are not end-to-end trainable. R-C3D is recently developed in [26] to save computation through sharingconvolutional features between proposal and classification pipelines.

Yeung et al. [29] present an end-to-end learning method, which directly predicts temporalboundaries from raw videos and exhibits fast inference capability. In that work, a recurrentattention model is learned to select a subset of frames to interact with, and maintains highdetection accuracy. Our work differs from [29] mainly in: (1) unlike their binary modelspecifically trained for each action class, our approach is able to handle multiple classes; (2)rather than using a separate emission signal to identify foreground segments, we consider allpredicted outputs as valid segments since we include the background as an additional class;(3) their reward function is designed to maximize true positives and minimize false positives,while our retrieval loss is directly defined on the mAP; and (4) instead of using two schemes(i.e., back-propagation for candidate detection and REINFORCE for prediction indicator andnext observation) to train their learning agent, we employ a unified recurrent policy gradientto train the entire policy altogether.

3 Problem FormulationGiven a video v and a set of activity labels L , our goal is to predict for each frame a singlelabel from L. We call each temporal extent consisting of consecutive frames with the samelabel a semantic temporal segment. As stated in Section 1, given a limited time budget, itis infeasible to process every single frame in a video. So we aim to detect and classify theforeground segments by only observing a small subset of video frames x⊂ v.

Citation

Citation

{Simonyan and Zisserman} 2014

Citation

Citation

{Yang, Molchanov, and Kautz} 2016

Citation

Citation

{Carreira and Zisserman} 2017

Citation

Citation

{Tran, Bourdev, Fergus, Torresani, and Paluri} 2015

Citation

Citation

{Gu, Yang, Mello, and Kautz} 2017

Citation

Citation

{Yang, Molchanov, and Kautz} 2018

Citation

Citation

{Soomro, Idrees, and Shah} 2016

Citation

Citation

{Gkioxari and Malik} 2015

Citation

Citation

{Kalogeiton, Weinzaepfel, Ferrari, and Schmid} 2017

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Shou, Chan, Zareian, Miyazawa, and Chang} 2017

Citation

Citation

{Xu, Das, and Saenko} 2017

Citation

Citation

{Yeung, Russakovsky, Mori, and Fei-Fei} 2016

Citation

Citation



Assuming limited access to the frames of v, finding the optimal frame subset x is inher-ently a sequential decision making task. Accordingly, we draw on ideas from reinforcementlearning—an area that focuses on learning for sequential decision making problems. Ouraim is to learn a policy π , parameterized by θ , to sequentially select the frames from v andform the subset x. Alongside the selection process, π outputs the current belief about theforeground segment and the associated class label. This sequential decision making pro-cess intuitively resembles how humans search activities in a video, i.e., iteratively refine ourestimated temporal boundaries by sequentially choosing a few frames to observe.

Let G denote the ground truth segments in v, and Mx be the set of estimated semantictemporal segments from observing x. We define the deterministic indicator Im,g to identifywhether an estimated segment m ∈Mx is assigned to a ground truth segment g ∈G:

Im,g =

{1 g = argmaxg′∈G α(m,g′) subject to α > 0,0 otherwise,

(1)

where α is the intersection over union (IoU). Let cm and cg indicate the probability distri-bution and the one-hot representation of class label for segments m and g. For a subset ofselected frames x and a set of predicted segments Mx, our loss is defined as:

Lθ = ∑m∈Mx

∑g∈G

Im,g

[λc∆cls(cm,cg)+λl∆loc(lm, lg)

]+λr∆ret(Mx,G), (2)

where ∆cls is the multi-class classification error, ∆loc is the localization error with lm and lgidentifying the locations of segments m and g, and ∆ret is the segment retrieval error. Themost important property of ∆ret is that while it encourages the model to detect all foregroundsegments, it also discourages the model from producing many false positives.

We now explain how to formulate each individual error defined in Eq. (2). In contrastto using a binary classification loss as in [29], we employ a multi-class cross-entropy loss∆cls = −cg logcm. Unlike [29] which penalizes the localization based on the absolute error,we believe this loss should also depend on the duration of a segment, i.e., the same amountof absolute error should be treated differently for short and long intervals. Intuitively, thismeans that if the policy makes a small error for a short segment this error should be con-sidered relatively large, otherwise the algorithm would ignore the small segments. With thisintention, we define ∆loc(lm, lg) = ζ (g)×‖(ms,me),(gs,ge)‖, where ζ (g) is a scaling factorwhich depends on the length of segment g, ‖·‖ is the distance between two segments, ms andme are the start and end of segment m, similar for segment g. To define the segment retrievalloss ∆ret(Mx,G), we use the mAP criteria, where mean is over different class labels, andAP for each individual class is defined as AP(Mx,G) = ∑i Prec(Mx(i),G)×∆Recall, whereMx(i) is the subset of Mx till the ith segment ranked by the overlap with ground truth, Prec(·)is the precision of detection, and ∆Recall is the change of recall from previous subset. Givena training set of N videos {v1, · · · ,vN}, our goal is to find θ that minimizes:

θ∗ = argmin

θ

[E(Lθ )≈

1N

N

∑n=1

Lθ (Mxn,Gn)

]. (3)

Unfortunately, the standard back-propagation is not applicable to learn the parameters inEq. (3), as the objective function in Eq. (2) contains the non-differentiable components. Thisis mainly due to the non-decomposable AP, as well as the sequential decision making processin selecting video frames. In order to solve this difficulty, we reformulate our problem as areinforcement learning problem, which allows us to define an equivalent reward function tothe original objective function.

Citation

Citation


Citation

Citation



4 Recurrent Policy NetworkIn this section, we describe our proposed representation for the policy π , and present the ap-proach for learning the parameters θ by drawing on the recurrent policy gradient estimation.

4.1 Policy RepresentationOur activity detection agent makes a sequence of predictions based on the local informationfrom the most recent observed frame. At each step, the policy produces three outputs includ-ing the estimate of start and end of the current potential temporal segment, the prediction ofclass label associated with the segment, and the next frame to observe. Unlike the binarymodel used in [29], our approach enables us to define a multi-class classifier, which meansthat we only need to train a single policy rather than training multiple different policies. Notethat this also allows us to avoid the binary prediction indicator signal used in [29], since wecan directly discard those segments predicted with the background label.

Due to the local observation at each step, the policy has no access to the global state(i.e., the entire video). This resembles the partially observable Markov decision process(POMDP), which assumes that despite the existence of a global state, for practical reasonsan agent does not have a full observation of the global state. We adopt the recurrent policygradient approach [24] to maintain an approximate belief of the current state st by LSTM.

Particularly, suppose at step t the current frame is i, the policy π makes a decision basedon (1) the local information of a neighborhood Ni centered around i and (2) the history ofprevious observations. We capture the local information through an observation feature ot =[ψ(Ni),φ(Ni),ξt ], where ψ(Ni) is an indicator vector that identifies whether each frame inNi has been previously selected, φ(N j) is the average of per-class confidence predicted inNi, and ξt ∈ [0,1] is the normalized location of the current frame at step t. The inclusion of ξtis helpful in encouraging the policy to cover broader video content. During our experiments,we find that excluding ξt results in a considerable number of over-selection of frames. Notethat for φ , we compute the averaged confidence of estimated segments, which share theframes in Ni. As for the history of the decision makings, we use the hidden state ht−1 ofLSTM to maintain the context of previous observations up to step t.

To summarize, the global state at step t is approximated by the internal state ht of LSTM,which depends on the current observation ot and the previous state ht−1. Given ht the outputsof the policy π are νt = [lt ,ct ,ξt+1]: (1) the location lt of an estimated temporal segment,(2) the probability distribution over activity class labels ct , and (3) the location of the nextobservation ξt+1. Note that our formulation allows the policy to perform both forward andbackward selections. In order to further improve the exploration at training phase, instead ofdirectly using ξt+1, the next selected location is sampled from a Gausssian distribution witha mean equal to ξt+1 and a fixed variance (e.g., 0.18 used in our experiments).

4.2 Policy LearningOur goal of policy learning is to jointly optimize the parameters of π by minimizing the lossof a sequence of policy actions as defined in Eq. (2). These actions are taken from the initialstate s0, when no frames are selected, until the final state sT , where T is the number of stepsspecified according to a time budget.

The main difficulty in policy learning is that the estimated temporal segments Mx for avideo are computed through a sequence of policy decisions, resulting in a non-decomposable

Citation

Citation


Citation

Citation


Citation

Citation

{Wierstra, F{ö}rster, Peters, and Schmidhuber} 2010


and non-differentiable objective function. Moreover, a decision that the policy makes at anystep depends on the history of decisions that the policy has made in previous steps, and alsoimpacts the decisions available to the policy in the future. Among the potential algorithms foraddressing similar POMDP problems [2, 3, 24, 25], we adopt the recurrent policy gradientapproach [24], which provides better theoretical bounds on learning objective to approximatethe gradients of our non-decomposable and non-differentiable objective function, so that thepolicy can be efficiently learned with stochastic gradient descent.

To follow the general reinforcement learning formulation, let r be the immediate rewardassociated with a state st . Since st ≈ ht in our policy, we define r as r(ht) = Lθ (Mx

t−1,G)−Lθ (Mx

t ,G), where Lθ is the loss associated with a set of estimated temporal segments asdefined in Eq. (2). Intuitively, r(ht) states that the policy earns an immediate reward equalto the decrease in the temporal segmentation error achieved by selecting an observed frame,or pays a penalty if the temporal segmentation error increases. Let R(Ht) be the discountedaccumulated reward starting from the state st and continuing the policy up to the final statesT : R(Ht) = ∑

Tt ′=t τ t ′−tr(ht ′), where Ht = {ht , ...,hT} represents the history of hidden states

in LSTM, and τ ∈ (0,1) is the discount factor. H0 can be interpreted as the trajectory ofobservations for a sample run of the policy from the initial state. For notational simplicity,we use H for H0 in the rest of this paper. The goal of policy learning is transformed to findthe parameters θ ∗ to maximize J(θ) which is defined as:

J(θ) = E[R(H)] =∫

p(H|θ)Rθ (H)dH, (4)

where p(H|θ) is the probability of observing a sequence of hidden states H, given a policyπ defined by the parameters θ . It can be shown that maximizing J(θ) implicitly minimizesLθ along the trajectory of policy executions. We now derive how to compute the gradientwith respect to the policy parameters ∇θ J, which is given by:

∇θ J =∫ [

∇θ p(H|θ)Rθ (H)+ p(H|θ)∇θ Rθ (H)]dH. (5)

Note that given the sequence of hidden states H, which determines the history of selectedframes, the reward function does not depend on the policy parameters, yielding ∇θ Rθ (H) =0. To further simplify Eq. (5), we need to define ∇θ p(H|θ). So we first factorize p(H|θ) asp(H|θ) = p(h0)∏

Tt=1 p(ht |ht−1)π(νt |ht−1,ot), where the same notation π is used to denote

the output of the policy. Based on this we have: log p(H|θ) = const+∑Tt=1 logπ(νt |ht−1,ot),

where the first term is a sum over the log of p(ht |ht−1), a constant with respect to θ . Thistherefore results in the following gradient: ∇θ log p(H|θ) = ∑

Tt=1 ∇θ logπ(νt |ht−1,ot).

It is common to use the Monte-Carlo integration to approximate the integration over theprobability of observing a sequence of hidden states. Specifically, the approximate gradientis computed by running the current policy on N training videos to generate N trajectories.Combining aforementioned derivations and Eq. (5), we can obtain the approximate gradient:

∇θ J ≈ 1N

N

∑n=1

T

∑t=1

[∇θ logπ(νn

t |hnt−1,o

nt )Rθ (hn

t )]. (6)

Since the policy gradient methods usually suffer from the high variance of gradient estimates,we follow the common practice used in [25] to subtract a bias from the expected reward R.However, rather than taking a constant bias, we set the bias value to be the reward obtainedfrom a random selection policy.

Citation

Citation

{Cho, Cho, and Yow} 2017

Citation

Citation

{Egorov, Sunberg, Balaban, Wheeler, Gupta, and Kochenderfer} 2017

Citation

Citation


Citation

Citation

{Williams} 1992

Citation

Citation


Citation

Citation

{Williams} 1992


Figure 2: (a) Comparison of the detection accuracy of baselines and variations of our modelsunder different IoU thresholds α on THUMOS14. (b) Comparison of the detection accuracy(green dots) and the speed (blue bars) with different policy steps on ActivityNet.

5 ExperimentsIn this section, we extensively evaluate our approach on the two benchmark datasets: THU-MOS14 [9] and ActivityNet [7]. Experimental results demonstrate that our approach is ableto substantially reduce computational time, and meanwhile provides competitive detectionaccuracy under varying time budgets. In the supplementary material, we explain how to cal-culate the detection speeds of different methods, and present the illustration of the learnedpolicy for frame selection.

We use the pre-trained VGG16 [17] on ImageNet as our backbone CNN, and fine-tunethe network on each dataset. We take the layer fc7 of VGG16 as the per-frame feature.Our policy is based on a two-layer LSTM, and each layer has 1024 hidden units. If nototherwise specified, our policy takes T = 6 steps, which we believe are efficient enough tomeet a reasonable time budget constraint. We empirically set the weights in our loss functionof Eq. (2) as λc = λl = 1.0 and λr = 0.5. We set the batch size to 128, and train each modelfor 100 epochs using Adam with the default parameters. We implement our networks inTensorFlow and perform experiments on a single NVIDIA Tesla P100 GPU.

5.1 Baselines and Variations of Our ModelTo provide a better insight, we first study the different configurations of baseline methodand our model. We define the following two important baselines, each of them outputs aclass label for a single frame including the background class, followed by a non-maximumsuppression (NMS) post-processing to aggregate the class probabilities of single frames.• CNN with NMS: VGG16 is fine-tuned on the single frames of each dataset and used

to perform per-frame classification.• LSTM with NMS: According to the setting in our model, we train LSTM on top of

the layer fc7 of VGG16, and a prediction is made for each single frame.We can improve our base model mainly in the two ways and define the multiple variationsof our model as follows.• Oursbas is the base model defined in Sec. 4.1.• Oursdif provides the simple pixel-level frame difference between consecutive frames

to roughly capture motion cues. Although we can incorporate optical flow [20], giventhe fact that optical flow is usually computationally expensive, and the motivation of

Citation

Citation


Citation

Citation

{Heilbron, Escorcia, Ghanem, and Niebles} 2015

Citation

Citation

{Simonyan and Zisserman} 2015

Citation

Citation

{Sun, Yang, Liu, and Kautz} 2018


this work is for fast inference, we believe that the pixel-level frame subtraction is areasonable compromise. We use early fusion to concatenate RGB channels of theoriginal frame and the frame difference as a composite input to VGG16.• Oursreg performs a simple post-processing to refine the policy output. A simple linear

regressor is trained to refine the boundaries of detected temporal segments by usingthe current temporal extend of a segment, and the κ uniformly sampled frames alongwith their pixel-level frame differences within this segment.

5.2 Results on THUMOS14

We follow the standard experimental setting on THUMOS14 [9]. We first present the abla-tion study to understand the contribution of each individual component in our model. As ex-pected and shown in Figure 2(a), providing the additional simple frame difference to capturecoarse motion cues improves the accuracy of all baselines and our models. The baseline re-current models with LSTM are found to produce better results than CNN. All of our ablationmodels significantly outperform the baselines, quantifying the contributions of our proposedapproach. In particular, our results are generated by only observing 6 policy-selected frames,far more efficient than the baselines that have to densely go through all video frames.

Method Time (s) α = 0.3 α = 0.4 α = 0.5 α = 0.6 α = 0.7

LEAR14 [12] > 108 28.8 21.8 15.0 8.5 3.2CUHK14 [22] > 108 14.6 12.1 8.5 4.7 1.5Pyramid of Scores [30] > 108 33.6 26.1 18.8 - -Fast Temporal Proposals [8] 108 - - 13.5 - -S-CNN [14] 92 36.3 28.7 19.0 10.3 5.3CDC [15] 92 40.1 29.4 23.3 13.1 7.9DAPs [4] 41 - - 13.9 - -Language Model [13] 17 30.0 23.2 15.2 - -R-C3D [26] 5.3 44.8 35.6 28.9 - -Glimpses [29] 4.9 36.0 26.4 17.1 - -

Ours 0.35 38.4 28.9 22.4 11.4 7.0

Table 1: Comparison of our approach and the state-of-the-art methods in the approximatecomputation time to process each video and the detection accuracy on THUMOS14.

It is interesting to observe that the simple linear regression based post-processing withκ uniformly sampled frames from estimated segments helps in refining the temporal bound-aries in Figure 2(a). We conjecture that this is due to the fact that our policy is allowed toobserve frames in a temporally inconsistent way, i.e., selecting frames in a mixed forwardand backward fashion. LSTM thus tends to smooth out the features to some extent duringthis process. We hypothesize that observing the κ sampled frames in the simple regressionprovides a temporally consistent description complementary to the averaged latent repre-sentation that lacks temporal consistency in the policy. We also evaluate the impact of thenumber of sampled frames κ to the regression. As shown in Figure 2(a), we only observemarginal gains when sampling over 10 frames, which also implies that our policy has alreadylearned to select fairly representative frames to perform the detection.

We then compare with the state-of-the-art methods in Table 1. Our approach achievescompetitive detection accuracy under various IoU thresholds, and more importantly, we per-

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Richard and Gall} 2016

Citation

Citation


Citation

Citation



Method Time (s) α = 0.5 α = 0.75 α = 0.95 ave-mAP

UTS16 [23] > 930 45.1 4.1 0.0 16.4CDC [15] > 930 45.3 26.0 0.2 23.8UPC16 [11] 11 22.5 - - -OBU16 [18] 10 22.7 10.8 0.3 11.3R-C3D [26] 3.2 26.8 - - -

Ours 0.35 41.9 22.6 0.1 21.5

Table 2: Comparison of our approach and the state-of-the-art methods in the approximatecomputation time to process each video and the detection accuracy on ActivityNet.

form detection in only 0.35s for each untrimmed long video. This is orders of magnitudefaster than most other competing algorithms relying on sliding windows or segment propos-als. While R-C3D produces superior accuracy on this dataset, we significantly outperformR-C3D on ActivityNet (see Table 2), indicating the advantage of our approach to handle themore complex activities. We specifically compare the per-class breakdown AP of our modeland the glimpses method [29] that also exhibits efficient inference for each binary detection.As shown in Figure 3(a), our approach largely outperforms [29] in 15 out of 20 classes,and by 5.3% in overall mAP. Notably, our method is a unified model to handle all classes,while [29] is a binary model that needs to be trained for each specific class of the 20 actions,therefore we need to run their models multiple times to detect the whole classes.

We also provide the in-depth analysis of the computational costs of our approach. Fig-ure 3(b) shows the percentage of time for each major algorithm step and computation compo-nent. Our policy is quite efficient to run, takes only 9.4% of the time. The feature extractionthat involves applying VGG16 to multiple frames dominates the computations, consuming80.1% of the time, which again highlights the importance of effective frame selection to re-duce the computational burden. In addition, it uses only 2.9% of the time to compute theframe differences, which can provide coarse but useful motion cues.

5.3 Results on ActivityNetCompared to THUMOS14, ActivityNet [7] contains high-level semantics with complex ac-tions, objects and scenes in videos, and is much larger in number of activities and amount ofvideos. Based on this large-scale dataset, we first evaluate our model with different policysteps, according to different time budgets that the detection system can afford. As shown in

Figure 3: (a) Comparison of the per-class breakdown AP at α = 0.5 on THUMOS14. (b)Analysis of computational time of our approach: percentage of time spent on each majoralgorithm step (left) and computation component (right) to perform detection.

Citation

Citation

{Wang and Tao} 2016

Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation


Citation

Citation

{Heilbron, Escorcia, Ghanem, and Niebles} 2015


Figure 2(b), when the policy steps move up from 6 to 30 by increasing the time budget from0.35s to 1.59s, the detection accuracy is improved by about 2.0% under α = 0.5.

Finally, we compare our approach with the state-of-the-art methods in Table 2. Similar tothe results on THUMOS14, our approach substantially reduces the detection time by ordersof magnitude compared to other methods. While CDC provides very competitive accuracy,it relies on the detection results of UTS16, i.e., CDC is primarily used to refine the predictedtemporal boundaries of UTS16. If directly applying CDC on raw videos, the accuracy dropsto around 15.0% at α = 0.5. UTS16 is sliding window based, and requires multiple ex-pensive feature extractions including iDT, C3D, ResNet152, and InceptionV3, whihch areconsiderably expensive for inference. We achieve significant improvement over the most re-cent state-of-the-art method R-C3D, demonstrating the superiority of our approach to tacklethe more complex activities. Figure 4 demonstrates the prediction examples of our model onActivityNet, including the challenging classes that involve great scale change, large view-point variations, crowded background, etc. Since the glimpses method [29] is a binary modeland their detection result on the entire 200 classes of this dataset is not provided, we trainour policy on the same two subsets (i.e., sports and work) as [29] for fair comparisons. Morecomparison details are provided in the supplementary material.

6 ConclusionWe have presented a fully end-to-end approach for the challenging problem of efficient tem-poral activity detection. We formulate the budget-aware inference for this problem to opti-mally select a small subset of frames within the MDP framework. We propose the LSTMbased policy model to handle the whole activity classes by a single training phase. A policygradient is further developed to approximate the gradient of our non-decomposable and non-differentiable objective. Experiments demonstrate that our approach brings substantial timesaving and maintains competitive detection accuracy. This provides a practical solution formany applications that require tight runtime constraints and limited on-device computations.

Figure 4: Examples of predicted results on ActivityNet. Each row shows four sampledframes within the temporal extent of a detected activity. Faded frames indicate the framesoutside the detected temporal boundary.

Citation

Citation


Citation

Citation



References[1] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the

Kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.

[2] S. Cho, S. Cho, and K.-C. Yow. A robust time series prediction model using POMDPand data analysis. Journal of Advances in Information Technology (JAIT), 2017.

[3] M. Egorov, Z. Sunberg, E. Balaban, T. Wheeler, J. Gupta, and M. Kochenderfer.POMDPs.jl: A framework for sequential decision making under uncertainty. Journalof Machine Learning Research (JMLR), 2017.

[4] V. Escorcia, F. Heilbron, J. Niebles, and B. Ghanem. DAPs: Deep action proposals foraction understanding. In European Conference on Computer Vision (ECCV), 2016.

[5] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015.

[6] J. Gu, X. Yang, S. De Mello, and J. Kautz. Dynamic facial analysis: From Bayesianfiltering to recurrent neural network. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017.

[7] F. Heilbron, V. Escorcia, B. Ghanem, and J. Niebles. ActivityNet: A large-scale videobenchmark for human activity understanding. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2015.

[8] F. Heilbron, J. Niebles, and B. Ghanem. Fast temporal activity proposals for efficientdetection of human actions in untrimmed videos. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[9] H. Idrees, A. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. TheTHUMOS challenge on action recognition for videos “in the wild". Computer Visionand Image Understanding (CVIU), 2017.

[10] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector forspatio-temporal action localization. In International Conference on Computer Vision(ICCV), 2017.

[11] A. Montes, A. Salvador, S. Pascual, and X. Giro. Temporal activity detection inuntrimmed videos with recurrent neural networks. In NIPS Workshop, 2016.

[12] D. Oneata, J. Verbeek, and C. Schmid. The LEAR submission at THUMOS 2014. InECCV THUMOS Workshop, 2014.

[13] A. Richard and J. Gall. Temporal action detection using a statistical language model.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[14] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videosvia multi-stage CNNs. In IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2016.


[15] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-C. Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[16] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recog-nition in videos. In Neural Information Processing Systems (NIPS), 2014.

[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scaleimage recognition. In International Conference on Learning Representations (ICLR),2015.

[18] G. Singh and F. Cuzzolin. Untrimmed video classification for activity detection: Sub-mission to ActivityNet challenge. In CVPR ActivityNet Workshop, 2016.

[19] K. Soomro, H. Idrees, and M. Shah. Predicting the where and what of actors andactions through online action localization. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016.

[20] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyra-mid, warping and cost volume. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018.

[21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: Generic features forvideo analysis. In International Conference on Computer Vision (ICCV), 2015.

[22] L. Wang, Y. Qiao, and X. Tang. Action recognition and detection by combining motionand appearance features. In ECCV THUMOS Workshop, 2014.

[23] R. Wang and D. Tao. UTS at ActivityNet 2016. In CVPR ActivityNet Workshop, 2016.

[24] D. Wierstra, A. Förster, J. Peters, and J. Schmidhuber. Recurrent policy gradients.Logic Journal of IGPL, 2010.

[25] R. Williams. Simple statistical gradient-following algorithms for connectionist rein-forcement learning. Machine Learning, 1992.

[26] H. Xu, A. Das, and K. Saenko. R-C3D: Region convolutional 3D network for temporalactivity detection. In International Conference on Computer Vision (ICCV), 2017.

[27] X. Yang, P. Molchanov, and J. Kautz. Multilayer and multimodal fusion of deep neuralnetworks for video classification. In ACM Multimedia, 2016.

[28] X. Yang, P. Molchanov, and J. Kautz. Making convolutional networks recurrent forvisual sequence learning. In IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2018.

[29] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of actiondetection from frame glimpses in videos. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2016.

[30] J. Yuan, B. Ni, X. Yang, and A. Kassim. Temporal action localization with pyramidof score distribution features. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016.

budget-aware activity detection with a recurrent policy ... · proach [10]. however, these methods...

Documents