deep learning for spatial and temporal video localization
TRANSCRIPT
DOCTORAL THESIS
Deep Learning for Spatial andTemporal Video Localization
AuthorRui SU
SupervisorProf Dong XU
Co-SupervisorDr Wanli OUYANG
A thesis submitted in fulfillment of the requirementsfor the degree of Doctor of Philosophy
in the
School of Electrical and Information Engineering
Faculty of Engineering
June 17 2021
Abstract of thesis entitled
Deep Learning for Spatial and Temporal Video
Localization
Submitted by
Rui SU
for the degree of Doctor of Philosophy
at The University of Sydney
in June 2021
In this thesis we propose to develop novel deep learning algorithmsfor the video localization tasks including spatio-temporal action local-ization temporal action localization spatio-temporal visual groundingwhich require to localize the spatio-temporal or temporal locations oftargets from videos
First we propose a new Progressive Cross-stream Cooperation (PCSC)framework for the spatio-temporal action localization task The basicidea is to utilize both spatial region (resp temporal segment proposals)and features from one stream (ie the FlowRGB stream) to help an-other stream (ie the RGBFlow stream) to iteratively generate betterbounding boxes in the spatial domain (resp temporal segments in thetemporal domain) By first using our newly proposed PCSC frameworkfor spatial localization and then applying our temporal PCSC frameworkfor temporal localization the action localization results are progressivelyimproved
Second we propose a progressive cross-granularity cooperation (PCG-TAL) framework to effectively take advantage of complementarity be-tween the anchor-based and frame-based paradigms as well as betweentwo-view clues (ie appearance and motion) for the temporal action
localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner
Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness
Deep Learning for Spatial andTemporal Video Localization
by
Rui SU
BE South China University of Technology MPhil University of Queensland
A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of
Doctor of Philosophy
at
University of SydneyJune 2021
COPYRIGHT ccopy2021 BY RUI SU
ALL RIGHTS RESERVED
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
Abstract of thesis entitled
Deep Learning for Spatial and Temporal Video
Localization
Submitted by
Rui SU
for the degree of Doctor of Philosophy
at The University of Sydney
in June 2021
In this thesis we propose to develop novel deep learning algorithmsfor the video localization tasks including spatio-temporal action local-ization temporal action localization spatio-temporal visual groundingwhich require to localize the spatio-temporal or temporal locations oftargets from videos
First we propose a new Progressive Cross-stream Cooperation (PCSC)framework for the spatio-temporal action localization task The basicidea is to utilize both spatial region (resp temporal segment proposals)and features from one stream (ie the FlowRGB stream) to help an-other stream (ie the RGBFlow stream) to iteratively generate betterbounding boxes in the spatial domain (resp temporal segments in thetemporal domain) By first using our newly proposed PCSC frameworkfor spatial localization and then applying our temporal PCSC frameworkfor temporal localization the action localization results are progressivelyimproved
Second we propose a progressive cross-granularity cooperation (PCG-TAL) framework to effectively take advantage of complementarity be-tween the anchor-based and frame-based paradigms as well as betweentwo-view clues (ie appearance and motion) for the temporal action
localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner
Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness
Deep Learning for Spatial andTemporal Video Localization
by
Rui SU
BE South China University of Technology MPhil University of Queensland
A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of
Doctor of Philosophy
at
University of SydneyJune 2021
COPYRIGHT ccopy2021 BY RUI SU
ALL RIGHTS RESERVED
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner
Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness
Deep Learning for Spatial andTemporal Video Localization
by
Rui SU
BE South China University of Technology MPhil University of Queensland
A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of
Doctor of Philosophy
at
University of SydneyJune 2021
COPYRIGHT ccopy2021 BY RUI SU
ALL RIGHTS RESERVED
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
Deep Learning for Spatial andTemporal Video Localization
by
Rui SU
BE South China University of Technology MPhil University of Queensland
A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of
Doctor of Philosophy
at
University of SydneyJune 2021
COPYRIGHT ccopy2021 BY RUI SU
ALL RIGHTS RESERVED
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
COPYRIGHT ccopy2021 BY RUI SU
ALL RIGHTS RESERVED
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
i
Declaration
I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications
Signed
Date June 17 2021
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
For Mama and PapaA sA s
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-
ii
Acknowledgements
First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher
I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future
Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything
Rui SU
University of SydneyJune 17 2021
iii
List of Publications
JOURNALS
[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020
[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020
CONFERENCES
[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019
v
Contents
Abstract i
Declaration i
Acknowledgements ii
List of Publications iii
List of Figures ix
List of Tables xv
1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11
2 Literature Review 1321 Background 13
211 Image Classification 13212 Object Detection 15213 Action Recognition 18
22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26
3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29
31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31
32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38
33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44
34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47
Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50
343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58
35 Summary 60
4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62
411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63
42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71
43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81
44 Summary 82
5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84
511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84
52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92
53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98
54 Summary 102
6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104
Bibliography 107
ix
List of Figures
11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6
31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33
32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i
t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi
t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39
33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43
34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57
35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58
41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 are denoted as SRGB0 PRGB
0
and QFlow0 The initial proposals and features from anchor-
based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66
42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80
43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81
51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86
52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89
53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93
xv
List of Tables
31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48
32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49
33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50
34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51
35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52
36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52
37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52
38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54
39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55
310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55
311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55
312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55
41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74
42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75
43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75
44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76
45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77
46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78
51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99
52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100
53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101
1
Chapter 1
Introduction
With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades
In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows
2 Chapter 1 Introduction
Spatio-temporal Action Localization
Framework
Input Videos
Action Tubes
Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes
11 Spatial-temporal Action Localization
Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task
Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each
11 Spatial-temporal Action Localization 3
other
For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation
On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation
Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region
4 Chapter 1 Introduction
proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level
Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level
Our contributions are briefly summarized as follows
bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy
bull We also propose a temporal boundary refinement method to learn
12 Temporal Action Localization 5
class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results
bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios
12 Temporal Action Localization
Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging
Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework
For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other
6 Chapter 1 Introduction
Ground Truth
P-GCN
Ours
Frisbee Catch783s 821s
Frisbee Catch799s 822s
Frisbee Catch787s 820s
Ground Truth
P-GCN
Ours
Golf Swing
Golf Swing
Golf Swing
210s 318s
179s 311s
204s 315s
Ground Truth
P-GCN
Ours
Throw Discus
Hammer Throw
Throw Discus
639s 712s
642s 708s
637s 715s
Time
Time
Video Input Start End
Anchor-based ProposalScore 082
Score 039Score 047
Score 033
Frame Actioness startingending
Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods
Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries
For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and
12 Temporal Action Localization 7
segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method
To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based
8 Chapter 1 Introduction
segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold
(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels
(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner
(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization
13 Spatio-temporal Visual Grounding
Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers
Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of
13 Spatio-temporal Visual Grounding 9
bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene
Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks
Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance
Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the
10 Chapter 1 Introduction
STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]
Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods
Our contributions can be summarized as follows
(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors
(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover
14 Thesis Outline 11
we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods
(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task
14 Thesis Outline
The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows
Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters
Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]
Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
12 Chapter 1 Introduction
Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod
Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future
13
Chapter 2
Literature Review
In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks
21 Background
211 Image Classification
Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters
14 Chapter 2 Literature Review
(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem
Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition
Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional
21 Background 15
neural network is built by stacking such inception modules to achievehigh image classification performance
VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance
Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper
212 Object Detection
The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem
16 Chapter 2 Literature Review
Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time
To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced
Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined
21 Background 17
anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map
Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming
Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps
All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map
18 Chapter 2 Literature Review
213 Action Recognition
Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other
For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities
Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and
21 Background 19
randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization
2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient
Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos
20 Chapter 2 Literature Review
22 Spatio-temporal Action Localization
Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task
Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization
Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization
Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these
22 Spatio-temporal Action Localization 21
two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency
As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization
The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos
A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes
22 Chapter 2 Literature Review
23 Temporal Action Localization
Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error
Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation
Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce
23 Temporal Action Localization 23
the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame
Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods
In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates
Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors
Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the
24 Chapter 2 Literature Review
prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise
Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU
Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction
In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way
24 Spatio-temporal Visual Grounding 25
24 Spatio-temporal Visual Grounding
Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects
Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer
In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects
Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes
One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors
26 Chapter 2 Literature Review
to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance
25 Summary
In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding
In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms
Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows
25 Summary 27
the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results
In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities
In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos
29
Chapter 3
Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization
Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal
30Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios
31 Background
311 Spatial temporal localization methods
Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]
Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization
Third temporal localization is to form action tubes from per-frame
31 Background 31
detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc
Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]
312 Two-stream R-CNN
Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams
Please note that most appearance and motion fusion approaches in
32Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion
32 Action Detection Model in Spatial Domain
Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results
321 PCSC Overview
The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression
The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation
32 Action Detection Model in Spatial Domain 33
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
t
IFlow
t
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(a)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(b)
=cup
Flow
4
Flow
2RGB
3
Stag
e 2
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
Feature
Cooperation
Detection
Head
=cup
RGB
1
RGB
0Flow
0=
cup
Flow
2
Flow
0
RGB
1=
cup
RGB
3RGB
1
Flow
2
Glo
bal F
low
Fe
atur
es
RGB
1
Flow
2
RGB
3
Flow
4
FRGB
1FFlow
2FRGB
3FFlow
4
Glo
bal R
GB
Fe
atur
es I
RGB
IFlow
Stag
e 3
Stag
e 4
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
1times1
1times1
RO
I Fea
ture
Extra
ctor
1times1
IRGB
IFlow
RGB
t
RGB
t
F RGB
t
F Flow
t
FRGB
t
Con
v
1times1
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
RO
I Fea
ture
Extra
ctor
IRGB
IFlow
Flow
t
Flow
t
F RGB
t
F Flow
tFFlow
t
Con
v
Con
v
RO
I Fea
ture
Extra
ctor
Mes
sage
Pa
ssin
g
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Impr
oved
Fe
atur
es
Det
ectio
n B
oxes
D
etec
tion
Box
es
Det
ectio
n B
oxes
(c)
Figu
re3
1(a
)O
verv
iew
ofou
rC
ross
-str
eam
Coo
pera
tion
fram
ewor
k(f
our
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
regi
onpr
opos
als
and
feat
ures
from
the
flow
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
and
Stag
e3
whi
leth
ere
gion
prop
osal
san
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2an
dSt
age
4Ea
chst
age
com
pris
esof
two
coop
erat
ion
mod
ules
and
the
dete
ctio
nhe
ad(
b)(c
)The
deta
ilsof
our
feat
ure-
leve
lcoo
pera
tion
mod
ules
thro
ugh
mes
sage
pass
ing
from
one
stre
amto
anot
her
stre
am
The
flow
feat
ures
are
used
tohe
lpth
eR
GB
feat
ures
in(b
)w
hile
the
RG
Bfe
atur
esar
eus
edto
help
the
flow
feat
ures
in(c
)
34Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages
After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33
322 Cross-stream Region Proposal Cooperation
We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream
In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes
The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage
32 Action Detection Model in Spatial Domain 35
The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)
Mathematically let P (i)t and B(i)t denote the set of region propos-
als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)
t is up-dated as P (i)
t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)
t is updated as
B(j)t = G(P (j)
t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)
0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams
With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results
Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the
36Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data
323 Cross-stream Feature Cooperation
To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections
Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci
2 Ci3 Ci
4Ci
5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui
2 Ui3 Ui
4 Ui5
Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are
32 Action Detection Model in Spatial Domain 37
insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement
We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci
2 Ci3 Ci
4 Ci5 l isin 2 3 4 5 Let us use the improvement
of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB
l with the help of the features CFlowl as
followsCRGB
l = fθ(CFlowl )oplus CRGB
l (31)
where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow
l ) nonlinearly extracts the message fromthe feature CFlow
l and then uses the extracted message for improvingthe features CRGB
l
The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB
l and CFlowl To this end
we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB
l in the bottom-up pathway are refined the corre-sponding features maps URGB
l in the top-down pathway are generatedaccordingly
The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages
38Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB
t
and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1
and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-
prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31
(b) Specifically the improved RGB feature FRGBt is obtained by apply-
ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB
t of the RGB stream is also used to help improve theROI-feature FFlow
t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage
324 Training Details
For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process
33 Temporal Boundary Refinement for Action
Tubes
After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this
33 Temporal Boundary Refinement for Action Tubes 39
Initi
al
Segm
ents
(a) I
nitia
l Seg
men
t Gen
erat
ion
Initi
al S
egm
ent
Gen
erat
ion
RG
B o
r Flo
w F
eatu
res
from
Act
ion
Tube
s
=cup
2
0
1
Initi
al S
egm
ent
Gen
erat
ion
Initi
al
Segm
ents
Stag
e 2
Segm
ent F
eatu
reC
oope
ratio
nD
etec
tion
Hea
dSe
gmen
t Fea
ture
Coo
pera
tion
Det
ectio
nH
ead
=cup
1
0
0
1
2
1
2
Impr
oved
Seg
men
t Fe
atur
es
Det
ecte
d Se
gmen
ts
Stag
e 1
Prop
osal
Coo
pera
tion
Prop
osal
Coo
pera
tion
Impr
oved
Seg
men
tFe
atur
es
Det
ecte
d Se
gmen
ts
(b) T
empo
ral P
CSC
RG
B F
eatu
res
from
A
ctio
n Tu
bes
0
Flow
Fea
ture
s fr
om
Act
ion
Tube
s
0
Actio
nnes
sD
etec
tor (
AD)
Actio
n Se
gmen
tD
etec
tor (
ASD
)Se
gmen
ts fr
om A
D
Uni
on o
f Seg
men
tsSe
gmen
ts fr
om A
SD
Initi
al S
egm
ents
Figu
re3
2O
verv
iew
oftw
om
odul
esin
our
tem
pora
lbo
unda
ryre
finem
ent
fram
ewor
k(a
)th
eIn
itia
lSe
gmen
tG
en-
erat
ion
Mod
ule
and
(b)
the
Tem
pora
lPr
ogre
ssiv
eC
ross
-str
eam
Coo
pera
tion
(T-P
CSC
)M
odul
eTh
eIn
itia
lSe
gmen
tG
ener
atio
nm
odul
eco
mbi
nes
the
outp
uts
from
Act
ionn
ess
Det
ecto
ran
dA
ctio
nSe
gmen
tDet
ecto
rto
gene
rate
the
init
ial
segm
ents
for
both
RG
Ban
dflo
wst
ream
sO
urT-
PCSC
fram
ewor
kta
kes
thes
ein
itia
lseg
men
tsas
the
segm
entp
ropo
sals
and
iter
ativ
ely
impr
oves
the
tem
pora
lact
ion
dete
ctio
nre
sult
sby
leve
ragi
ngco
mpl
emen
tary
info
rmat
ion
oftw
ost
ream
sat
both
feat
ure
and
segm
entp
ropo
sall
evel
s(t
wo
stag
esar
eus
edas
anex
ampl
efo
rbe
tter
illus
trat
ion)
The
segm
entp
ro-
posa
lsan
dfe
atur
esfr
omth
eflo
wst
ream
help
impr
ove
acti
onse
gmen
tdet
ecti
onre
sult
sfo
rth
eR
GB
stre
amat
Stag
e1
whi
leth
ese
gmen
tpro
posa
lsan
dfe
atur
esfr
omth
eR
GB
stre
amhe
lpim
prov
eac
tion
dete
ctio
nre
sult
sfo
rth
eflo
wst
ream
atSt
age
2Ea
chst
age
com
pris
esof
the
coop
erat
ion
mod
ule
and
the
dete
ctio
nhe
adT
hese
gmen
tpro
posa
llev
elco
oper
-at
ion
mod
ule
refin
esth
ese
gmen
tpro
posa
lsP
i tby
com
bini
ngth
em
ostr
ecen
tseg
men
tpro
posa
lsfr
omth
etw
ost
ream
sw
hile
the
segm
ent
feat
ure
leve
lcoo
pera
tion
mod
ule
impr
oves
the
feat
ures
Fi tw
here
the
supe
rscr
ipt
iisinR
GB
Flo
w
deno
tes
the
RG
BFl
owst
ream
and
the
subs
crip
ttde
note
sth
est
age
num
ber
The
dete
ctio
nhe
adis
used
toes
tim
ate
the
acti
onse
gmen
tloc
atio
nan
dth
eac
tion
ness
prob
abili
ty
40Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework
To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments
33 Temporal Boundary Refinement for Action Tubes 41
331 Actionness Detector (AD)
We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i
At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the
42Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
testing stage
332 Action Segment Detector (ASD)
Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below
(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture
33 Temporal Boundary Refinement for Action Tubes 43
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventuallyremoved by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively
but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details
(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three
44Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg
(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments
333 Temporal PCSC (T-PCSC)
Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework
33 Temporal Boundary Refinement for Action Tubes 45
from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments
Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results
46Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
34 Experimental results
We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343
341 Experimental Setup
Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset
Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005
To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes
34 Experimental results 47
To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ
Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch
342 Comparison with the State-of-the-art methods
We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted
48Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy
Results on the UCF-101-24 Dataset
The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32
Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05
Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792
Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for
34 Experimental results 49
Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector
IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -
Zolfaghari et al [105] RGB+Flow+Pose 476 - - -
Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219
Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172
Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237
PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289
per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs
Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary
50Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods
Results on the J-HMDB Dataset
For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset
Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05
Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748
We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU
34 Experimental results 51
Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds
IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -
Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -
Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442
Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463
Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476
PCSC (Ours) RGB+Flow 826 822 631 528
threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively
At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work
52Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
343 Ablation Study
In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method
Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset
Stage PCSC wo featurecooperation PCSC
0 755 7811 761 7862 764 7883 767 7914 767 792
Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector
Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631
Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset
Stage Video mAP ()0 6191 6282 631
Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to
34 Experimental results 53
verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement
Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work
54Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset
Stage LoR()1 6112 7193 9524 954
in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes
Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +
34 Experimental results 55
Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset
Stage MABO()1 8412 8433 8454 846
Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset
θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801
Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages
Stage 0 1 2 3 4
Detection speed (FPS) 31 29 28 26 24
Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages
Stage 0 1 2
Detection speed (VPS) 21 17 15
ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main
56Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
344 Analysis of Our PCSC at Different Stages
In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages
In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4
To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification
34 Experimental results 57
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detection instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)
345 Runtime Analysis
We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages
58Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Bounding Boxesfor RGB or Flow Stream
Detected Segmentsfor RGB or Flow Stream
(a) Action Detection Method (b) Action Segment Detector
Detected instances for the leftright ground-truth instance from Faster-RCNN
LeftRight ground-truth instance
Ground-truth Instance
Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN
Detected instances from our PCSC that are eventually removed by our TBR
Detected Instances for the leftright ground-truth instance from our PCSC
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(a) (b)
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
RPN SPN
RGB or Optical FlowImages
RGB or Optical Flow Features from Action Tubes
Initial Proposals
ROI PoolingSegmentFeature
Extraction
Initial Segment Proposals
DetectionHead
DetectionHead
Region Features Segment Features
Detected Boxesfor RGB or Flow
Detected Segmentsfor RGB or Flow
Single-stream Action Detection Model
Single-stream Action Segment Detector
(c) (d)
Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)
346 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams
34 Experimental results 59
and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))
Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)
While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue
60Chapter 3 Progressive Cross-stream Cooperation in Spatial and
Temporal Domain for Action Localization
which will be studied in our future work
35 Summary
In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets
61
Chapter 4
PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization
There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework
62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
41 Background
411 Action Recognition
Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction
412 Spatio-temporal Action Localization
The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues
41 Background 63
413 Temporal Action Localization
Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy
Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level
64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues
42 Our Approach
In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively
421 Overall Framework
We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is
an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN
i=1 and their temporal durations(tis tie)N
i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi
The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows
42 Our Approach 65
Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature
Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB
i |pRGBi = (tRGB
is tRGBie )NRGB
i=1 and PFlow = pFlowi |pFlow
i =
(tFlowis tFlow
ie )NFlow
i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB
0 PFlow0
for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB
0 SFlow0 and the predicted action categories
Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB
0 and QFlow0 respectively
The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB
0 and QFlow0 respectively
after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)
Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation
66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Anchor Feautres (Output)
Frame Features (Output)
Cross-granularity
RGB-StreamAFC Module
(Stage 1)
Flow-StreamAFC Module
(Stage 2)
RGB1
RGB-StreamAFC Module
(Stage 3)
RGB1Flow
1
Flow-StreamAFC Module
(Stage4)
Features RGB0 Flow
0Flow
0
Segments
Proposals RGB0
Flow0
Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
RGB
RGB
Flow
Message Passing
Flowminus2 RGB
minus2Flowminus1
Flowminus1
RGBminus2
Flow
RGB
Anchor AnchorFrame
Anchor Features (Output)
1 x 1 Conv
1 x 1 Conv
Detected Segments (Output)
Proposal Combination= cup cupRGB
Flowminus1
RGBminus2
Flow
InputProposals
Input Features
DetectionHead
1 x 1 Conv
1 x 1 Conv
Frame-based SegmentProposal Generation
1 x 1 Conv
1 x 1 Conv
Proposal Combination= cup cupFlow
RGBminus1
Flowminus2
RGB
Input Features
RGBminus1
Flowminus2
InputProposals
Flowminus2 RGB
minus1 RGBminus2
RGB
RGB
Flow
Flow
Flow
Cross-stream Cross-granularity
Anchor Anchor Frame
Message Passing
Features
Proposals Flow2
Features RGB3 Flow
2 RGB2
ProposalsRGB3
Flow2
Flow2
SegmentsFeatures Flow
2RGB2
SegmentsFeatures Flow
3
RGB3
RGB3
Flow4
Flow4RGB
4
SegmentsFeatures
Detected Segments (Output)
Frame Features (Output)
Message Passing
Outputs
Features RGB1 Flow
0RGB
0
ProposalsRGB1
Flow0
RGB1
RGB1 Flow
2 Flow1
Inputs
(a) Overview
(b) RGB-streamAFC module
(c) Flow-stremAFC module
Cross-stream Message Passing
Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB
nminus2 and features PRGBnminus2 QFlow
nminus2 aredenoted as SRGB
0 PRGB0 and QFlow
0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments
42 Our Approach 67
methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages
422 Progressive Cross-Granularity Cooperation
The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)
To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow
nminus1 and SRGBnminus2 as the input two-stream proposals and
uses preceding two-stream anchor-based features PFlownminus1 and PRGB
nminus2 andflow-stream frame-based feature QFlow
nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB
n and anchor-based feature PRGB
n as well as update the flow-stream frame-based fea-tures as QFlow
n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages
68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes
Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances
423 Anchor-frame Cooperation Module
The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)
Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features
42 Our Approach 69
PRGB0 PFlow
0 and QRGB0 QFlow
0 as well as proposals SRGB0 SFlow
0 as theinput as shown in Sec 421 and Fig 41(a)
Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow
nminus1 to the frame-based features QFlownminus2 both from the flow-stream
The message passing operation is defined as
QFlown =Manchorrarr f rame(PFlow
nminus2 )oplusQFlownminus2 (41)
oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow
n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow
n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB
n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB
nminus2 which flips the superscripts from Flow to RGB QRGB
n also leads to betterframe-based RGB-stream proposals QRGB
n
Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow
n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow
n And then we combine QFlown together with the
two-stream anchor-based proposals SFlownminus1 and SRGB
nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB
n = SRGBnminus2 cup SFlow
nminus1 cup
70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
QFlown This set of proposals not only provides more comprehensive pro-
posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow
n = SFlownminus2 cup SRGB
nminus1 cupQRGBn
Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB
n =M f lowrarrrgb(PFlownminus1 )oplus PRGB
nminus2 forthe RGB stream and PFlow
n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow
nminus2 for flow stream
Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB
n PFlown and
the updated anchor-based features PRGBn PFlow
n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes
424 Training Details
At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the
42 Our Approach 71
boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules
425 Discussions
Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants
Similarity of Results Between Adjacent Stages Note that even though
72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results
Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources
43 Experiment
431 Datasets and Setup
Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]
- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples
- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available
- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an
43 Experiment 73
existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1
Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches
Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24
432 Comparison with the State-of-the-art Methods
We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers
THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05
74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds
tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190
REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178
TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369
MGG [46] - - 539 468 374BMN [35] - - 560 474 388
TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491
ours 712 689 651 595 512
our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN
ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and
43 Experiment 75
Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used
tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -
R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398
TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699
ours 4431 2985 547 2885
Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied
tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003
P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385
ours 5204 3592 797 3491
BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN
UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05
76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds
st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -
MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278
ours 849 635 237 292
433 Ablation study
We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework
Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow
n and QRGBn are
directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo
CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow
nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and
43 Experiment 77
Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset
Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822
wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667
wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679
wo PC 4214 4321 4514 4574 4577
PRGBnminus1 and PFlow
nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison
In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB
0 cupPFlow0 for ldquowo CG andQRGB
0 cupQFlow0 for ldquowo PC We have
the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved
78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset
Stage Comparisons LoR ()
1 PRGB1 vs SFlow
0 6712 PFlow
2 vs PRGB1 792
3 PRGB3 vs PFlow
2 8954 PFlow
4 vs PRGB3 971
when the number of stages increases and the results will become stableafter 2 or 3 stages
Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB
1 at Stage 1 andPFlow
2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow
2 among all proposals in PFlow2 A proposal
is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB
1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow
2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB
1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other
Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow
4 and those in PRGB3 are very similar even they are gathered for
different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal
43 Experiment 79
action localization method after Stage 3 which is consistent with the re-sults in Table 45
Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005
In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries
80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
(a) AR50 () for various methods at different number of stages
(b) AR100 () for various methods at different number of stages
(c) AR200 () for various methods at different number of stages
(d) AR500 () for various methods at different number of stages
Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset
43 Experiment 81
Ground Truth
Anchor-based Method
PCG-TAL at Stage 3
977s 1042s
954s 1046s
973s 1043s
Frame-based Method Missed
Video Clip
964s 1046s
97s 1045s
PCG-TAL at Stage 1
PCG-TAL at Stage 2
Time
1042s 1046s977s
954s
954s
Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment
434 Qualitative Results
In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our
82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for
Temporal Action Localization
PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule
44 Summary
In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework
83
Chapter 5
STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our
84Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG
51 Background
511 Vision-language Modelling
Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work
The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations
512 Visual Grounding in ImagesVideos
Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et
52 Methodology 85
al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling
In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods
52 Methodology
521 Overview
We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip
k Kk=1 where Vclip
k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN
n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte
t=ts
containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B
86Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Pooled Video Feature
Input Textual Tokens
Spatial VisualG
rouding Netw
ork(SVG
-net)
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
HW T
Video Frames
Video Feature
Initial Object Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]
Textual Query
Word Em
beddingInput Textual Tokens
ST-ViLBERTVidoe C
lipSam
pling
Video FeatureC
lip Feature
Input Textual Tokens
Deconvs
Head
Head
Correlation
MLP
Spatial Pooling
Global Textual Feature
Text-guided Viusal Feature
Initial TubeG
eneration
Initial Object Tube
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
Input Textual Tokens
(a) Overview
of Our M
ethod
(b) Our SVG
-net(c) O
ur TBR
-net
Bounding Box Size
Bounding Box Center
Relevance Scores
Pooled Video Feature
Initial Object
TubePooled Video
Feature
SVG-net-core
Temporal Boundary
Refinem
ent Netow
rk(TBR
-net)Im
ageEncoder
H T
VideoC
lip Features
Initial Object
Tube
Refined O
bject Tube
[CLS] a child in a hat
hugs a toy [SEP]Textual Q
ueryW
ord Embedding
Input Textual Tokens
ST-ViLBERTC
lip Feature
Input Textual Tokens
Deconv
Head
Head
Correlation
amp Sigmoid
MLP
Spatial Pooling
Global
Textual Feature
Text-guided Viusal Feature
Anchor-based Branch
Frame-based Branch
Spatial Pooling
Refined O
bject Tube
(a) Overview
of Our M
ethod
(b) Our SVG
-net-core Module
(c) Our TB
R-net
Bounding Box Sizes
Bounding Box Centers
Relevance Scores
Pooled Video Feature
Initial Object
Tube
W
BB Centers
BB Sizes
Relevance Scores
SVG-net-core
Initial TubeG
eneration
Spatial Visual Grounding N
etwork (SVG
-net)
Figure51
(a)T
heoverview
ofour
spatio-temporalvideo
groundingfram
ework
STVG
BertO
urtw
o-stepfram
ework
consistsof
two
sub-networksSpatial
Visual
Grounding
network
(SVG
-net)and
Temporal
BoundaryR
efinement
net-w
ork(T
BR-net)w
hichtakes
videoand
textualquerypairs
asthe
inputtogenerate
spatio-temporalobjecttubes
contain-ing
theobject
ofinterest
In(b)the
SVG
-netgenerates
theinitialobject
tubesIn
(c)theTBR
-netprogressively
refinesthe
temporalboundaries
oftheinitialobjecttubes
andgenerates
thefinalspatio-tem
poralobjecttubes
52 Methodology 87
In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections
522 Spatial Visual Grounding
Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames
Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip
k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature
For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two
88Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2
n=1 where en is the i-th textual input token
Multi-modal Feature Learning Given the visual input feature Fclipk
and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details
The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors
Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in
52 Methodology 89
Multi-head AttentionQuery Key Value
Add amp Norm
Multi-head AttentionQueryKeyValue
Add amp Norm
Feed Forward
Add amp Norm
Feed Forward
Add amp Norm
Visual Input (Output from the Last Layer)
Visual Output for the Current Layer
Textual Input (Output from the Last Layer)
Textual Output for the Current Layer
Visual Branch Textual Branch
Spatial Pooling Temporal Pooling
Combination
Multi-head Attention
Query Key ValueTextual Input
Combined Visual Feature
Decomposition
Replication Replication
Intermediate Visual Feature
Visual Input
Norm
(a) One Co-attention Layer in ViLBERT
(b) Our Spatio-temporal Combination and Decomposition Module
Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))
90Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively
Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv
to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box
Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending
52 Methodology 91
frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0
s t0e ) The
initial temporal boundaries (t0s t0
e ) and the predicted bounding boxes bt
form the initial tube prediction result Binit = btt0e
t=t0s
Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel
92Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
axy = exp(minus (xminusx)2+(yminusy)2
2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows
LSVG =λ1Lc(A A) + λ2Lrel(oobj o)
+ λ3Lsize(wxy hxy w h)(51)
where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details
523 Temporal Boundary Refinement
The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube
As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries
52 Methodology 93
Text
ual I
nput
To
kens
Star
ting
amp En
ding
Pr
obab
ilitie
s
Tem
pora
l Ext
ensi
onU
nifo
rm S
ampl
ing
ViLB
ERT
Boun
dary
Reg
ress
ion
Loca
l Fea
ture
Ext
ract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Exte
nded
Stra
ting
amp En
ding
Pos
ition
sAn
chor
Fea
ture
Refi
ned
Star
ting
amp En
ding
Pos
ition
s (
From
Anc
hor-b
ased
Bra
nch)
and
Star
ting
amp En
ding
Reg
ion
Feat
ures
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Anch
or-b
ased
Bran
chFr
ame-
base
dBr
anch
Initi
al S
tarti
ng amp
End
ing
Posi
tions
Refi
nde
Star
ting
amp En
ding
Pos
ition
s (F
rom
Fra
me-
base
d Br
anch
)
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Vide
o Po
oled
Fea
ture
Text
ual I
nput
Tok
ens
Anc
hor-
base
d B
ranc
h
Fram
e-ba
sed
Bra
nch
Refi
ned
Obj
ect
Tube
Initi
al O
bjec
t Tu
be
Vide
o Po
oled
Fe
atur
e
Text
ual I
nput
To
kens
Stag
e 1
Stag
e 2
Stag
e 3
Star
ting
Endi
ng
Prob
abilit
yAn
chor
Fea
ture
Extra
ctio
nVi
LBER
TBo
unda
ryR
egre
ssio
nLo
cal F
eatu
reEx
tract
ion
ViLB
ERT
1D C
onv
Boun
dary
Sele
ctio
n
Refi
ned
Posi
tions
(Fro
m A
B)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Anch
or-b
ased
Bran
ch (A
B)Fr
ame-
base
dBr
anch
(FB)
Initi
al P
ositi
ons
Refi
ned
Posi
tions
(F
rom
FB)
Pool
ed V
ideo
Fe
atur
eTe
xtua
l Inp
ut
Toke
nsPo
oled
Vid
eo
Feat
ure
Anc
hor-
base
d B
ranc
h (A
B)
Fram
e-ba
sed
Bra
nch
(FB
)
Refi
ned
Obj
ect T
ube
from
Initi
al O
bjec
t Tu
be
Pool
ed V
ideo
Fea
ture
Text
ual I
nput
Tok
ens
Stag
e 1
Stag
e 2
Figu
re5
3O
verv
iew
ofou
rTe
mpo
ralB
ound
ary
Refi
nem
entn
etw
ork
(TBR
-net
)w
hich
isa
mul
ti-s
tage
ViL
BER
T-ba
sed
fram
ewor
kw
ith
anan
chor
-bas
edbr
anch
and
afr
ame-
base
dbr
anch
(tw
ost
ages
are
used
for
bett
erill
ustr
atio
n)
The
inpu
tsta
rtin
gan
den
ding
fram
epo
siti
ons
atea
chst
age
are
thos
eof
the
outp
utob
ject
tube
from
the
prev
ious
stag
eA
tst
age
1th
ein
puto
bjec
ttub
eis
gene
rate
dby
our
SVG
-net
94Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage
Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets
Given an anchor with the temporal boundaries (t0s t0
e ) and the lengthl = t0
e minus t0s we temporally extend the anchor before and after the start-
ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0
s minus l4 t0e + l4] and generate the anchor feature F by stacking the
corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension
The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0
s and t0e and generate the new starting
and ending frame positions t1s and t1
e
Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1
s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1
s minusD2+ 1t1s + D2] where D is the length
53 Experiment 95
of the starting duration We then construct the frame-level feature Fs
by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1
s Accordingly we can produce the ending frameposition t1
e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1
s and t1e are generated they can be used as the
input for the anchor-based branch at the next stage
Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow
LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)
where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1
53 Experiment
531 Experiment Setup
Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset
96Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes
-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes
Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU
Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1
|SU | sumtisinSIrt
where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames
53 Experiment 97
between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples
Baseline Methods
-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals
-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes
-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT
-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net
98Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
532 Comparison with the State-of-the-art Methods
We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries
533 Ablation Study
In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method
Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step
It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating
53 Experiment 99
Tabl
e5
1R
esul
tsof
diff
eren
tmet
hods
onth
eV
idST
Gda
tase
tldquo
ind
icat
esth
ere
sult
sar
equ
oted
from
its
orig
inal
wor
k
Met
hod
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)m
_vIo
U(
)vI
oU
03(
)
vIoU
0
5(
)ST
GR
N
[102
]19
75
257
714
60
183
221
10
128
3V
anill
a(o
urs)
214
927
69
157
720
22
231
713
82
SVG
-net
(our
s)22
87
283
116
31
215
724
09
143
3SV
G-n
et+
TBR
-net
(our
s)25
21
309
918
21
236
726
26
160
1
100Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work
Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875
SVG-net (ours) 1945 2778 1012SVG-net +
TBR-net (ours)2241 3031 1207
the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT
Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively
Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)
53 Experiment 101
Tabl
e5
3R
esul
tsof
our
met
hod
and
its
vari
ants
whe
nus
ing
diff
eren
tnum
ber
ofst
ages
onth
eV
idST
Gda
tase
t
Met
hods
Sta
ges
Dec
lara
tive
Sent
ence
Gro
undi
ngIn
terr
ogat
ive
Sent
ence
Gro
undi
ngm
_vIo
UvI
oU
03
vIoU
0
5m
_vIo
UvI
oU
03
vIoU
0
5SV
G-n
et-
228
728
31
163
121
57
240
914
33
SVG
-net
+IF
B-ne
t1
234
128
84
166
922
12
244
814
57
223
82
289
716
85
225
524
69
147
13
237
728
91
168
722
61
247
214
79
SVG
-net
+IA
B-ne
t1
232
729
57
163
921
78
249
314
31
223
74
299
816
52
222
125
42
145
93
238
530
15
165
722
43
257
114
47
SVG
-net
+T
BR-n
et1
243
229
92
172
222
69
253
715
29
224
95
307
517
02
233
426
03
158
33
252
130
99
182
123
67
262
616
01
102Chapter 5 STVGBert A Visual-linguistic Transformer based
Framework for Spatio-temporal Video Grounding
Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods
54 Summary
In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding
103
Chapter 6
Conclusions and Future Work
As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future
61 Conclusions
The contributions of this thesis are summarized as follows
bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level
104 Chapter 6 Conclusions and Future Work
The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets
bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework
bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding
62 Future Work
The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into
62 Future Work 105
one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously
107
Bibliography
[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812
[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100
[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970
[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308
[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139
[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206
[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894
108 Bibliography
[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008
[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802
[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990
[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255
[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941
[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83
[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275
[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636
[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448
[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587
Bibliography 109
[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015
[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056
[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778
[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017
[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708
[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6
[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199
[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014
[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413
[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105
110 Bibliography
[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750
[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165
[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)
[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318
[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344
[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019
[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889
[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898
[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996
[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In
Bibliography 111
Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19
[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017
[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988
[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755
[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682
[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864
[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855
[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37
[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959
[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613
112 Bibliography
[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23
[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043
[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950
[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028
[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946
[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205
[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334
[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759
[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439
Bibliography 113
[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541
[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788
[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99
[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427
[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016
[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565
[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743
[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058
114 Bibliography
[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576
[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)
[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970
[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646
[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)
[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025
[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020
[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473
[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018
[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772
Bibliography 115
[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9
[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019
[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)
[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497
[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008
[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717
[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334
[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36
[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the
116 Bibliography
IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968
[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172
[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321
[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792
[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462
[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154
[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653
[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019
[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947
[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)
Bibliography 117
[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693
[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)
[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524
[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687
[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315
[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102
[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692
[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103
[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809
118 Bibliography
[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075
[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677
[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923
[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019
[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913
- Abstract
- Declaration
- Acknowledgements
- List of Publications
- List of Figures
- List of Tables
- Introduction
-
- Spatial-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Thesis Outline
-
- Literature Review
-
- Background
-
- Image Classification
- Object Detection
- Action Recognition
-
- Spatio-temporal Action Localization
- Temporal Action Localization
- Spatio-temporal Visual Grounding
- Summary
-
- Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
-
- Background
-
- Spatial temporal localization methods
- Two-stream R-CNN
-
- Action Detection Model in Spatial Domain
-
- PCSC Overview
- Cross-stream Region Proposal Cooperation
- Cross-stream Feature Cooperation
- Training Details
-
- Temporal Boundary Refinement for Action Tubes
-
- Actionness Detector (AD)
- Action Segment Detector (ASD)
- Temporal PCSC (T-PCSC)
-
- Experimental results
-
- Experimental Setup
- Comparison with the State-of-the-art methods
-
- Results on the UCF-101-24 Dataset
- Results on the J-HMDB Dataset
-
- Ablation Study
- Analysis of Our PCSC at Different Stages
- Runtime Analysis
- Qualitative Results
-
- Summary
-
- PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
-
- Background
-
- Action Recognition
- Spatio-temporal Action Localization
- Temporal Action Localization
-
- Our Approach
-
- Overall Framework
- Progressive Cross-Granularity Cooperation
- Anchor-frame Cooperation Module
- Training Details
- Discussions
-
- Experiment
-
- Datasets and Setup
- Comparison with the State-of-the-art Methods
- Ablation study
- Qualitative Results
-
- Summary
-
- STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
-
- Background
-
- Vision-language Modelling
- Visual Grounding in ImagesVideos
-
- Methodology
-
- Overview
- Spatial Visual Grounding
- Temporal Boundary Refinement
-
- Experiment
-
- Experiment Setup
- Comparison with the State-of-the-art Methods
- Ablation Study
-
- Summary
-
- Conclusions and Future Work
-
- Conclusions
- Future Work
-
- Bibliography
-