ava: a video dataset of spatio-temporally localized atomic ... · ava, with its realistic scene and...

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Chunhui Gu∗ Chen Sun∗ David A. Ross∗ Carl Vondrick∗ Caroline Pantofaru∗

Yeqing Li∗ Sudheendra Vijayanarasimhan∗ George Toderici∗ Susanna Ricco∗

Rahul Sukthankar∗ Cordelia Schmid† ∗ Jitendra Malik‡ ∗

Abstract

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVAdataset densely annotates 80 atomic visual actions in 43015-minute video clips, where actions are localized in spaceand time, resulting in 1.58M action labels with multiplelabels per person occurring frequently. The key charac-teristics of our dataset are: (1) the definition of atomicvisual actions, rather than composite actions; (2) precisespatio-temporal annotations with possibly multiple annota-tions for each person; (3) exhaustive annotation of theseatomic actions over 15-minute video clips; (4) people tem-porally linked across consecutive segments; and (5) usingmovies to gather a varied set of action representations. Thisdeparts from existing datasets for spatio-temporal actionrecognition, which typically provide sparse annotations forcomposite actions in short video clips.

AVA, with its realistic scene and action complexity, ex-poses the intrinsic difficulty of action recognition. To bench-mark this, we present a novel approach for action local-ization that builds upon the current state-of-the-art meth-ods, and demonstrates better performance on JHMDB andUCF101-24 categories. While setting a new state of the arton existing datasets, the overall results on AVA are low at15.6% mAP, underscoring the need for developing new ap-proaches for video understanding.

1. IntroductionWe introduce a new annotated video dataset, AVA, to ad-

vance action recognition research (see Fig. 1). The anno-tation is person-centric at a sampling frequency of 1 Hz.Every person is localized using a bounding box and the at-tached labels correspond to (possibly multiple) actions be-ing performed by the actor: one action corresponding tothe actor’s pose (orange text) — standing, sitting, walking,swimming etc. — and there may be additional actions cor-responding to interactions with objects (red text) or inter-∗Google Research†Inria, Laboratoire Jean Kuntzmann, Grenoble, France‡University of California at Berkeley, USA

Figure 1. The bounding box and action annotations in sampleframes of the AVA dataset. Each bounding box is associated with1 pose action (in orange), 0–3 interactions with objects (in red),and 0–3 interactions with other people (in blue). Note that someof these actions require temporal context to accurately label.

actions with other persons (blue text). Each person in aframe containing multiple actors is labeled separately.

To label the actions performed by a person, a key choiceis the annotation vocabulary, which in turn is determined bythe temporal granularity at which actions are classified. Weuse short segments (±1.5 seconds centered on a keyframe)to provide temporal context for labeling the actions in themiddle frame. This enables the annotator to use move-ment cues for disambiguating actions such as pick up orput down that cannot be resolved in a static frame. Wekeep the temporal context relatively brief because we areinterested in (temporally) fine-scale annotation of physicalactions, which motivates “Atomic Visual Actions” (AVA).The vocabulary consists of 80 different atomic visual ac-tions. Our dataset is sourced from the 15th to 30th minutetime intervals of 430 different movies, which given the 1 Hzsampling frequency gives us nearly 900 keyframes for eachmovie. In each keyframe, every person is labeled with (pos-sibly multiple) actions from the AVA vocabulary. Each per-son is linked to the consecutive keyframes to provide shorttemporal sequences of action labels (Section 4.3). We now

arX

iv:1

705.

0842

1v4

[cs

.CV

] 3

0 A

pr 2

018

Figure 2. This figure illustrates the hierarchical nature of an activ-ity. From Barker and Wright [3], pg. 247.

motivate the main design choices of AVA.Atomic action categories. Barker & Wright [3] noted thehierarchical nature of activity (Fig. 2) in their classic studyof the ”behavior episodes” in the daily lives of the residentsof a small town in Kansas. At the finest level, the actionsconsist of atomic body movements or object manipulationbut at coarser levels, the most natural descriptions are interms of intentionality and goal-directed behavior.

This hierarchy makes defining a vocabulary of action la-bels ill posed, contributing to the slower progress of ourfield compared to object recognition; exhaustively listinghigh-level behavioral episodes is impractical. However ifwe limit ourselves to fine time scales, then the actionsare very physical in nature and have clear visual signa-tures. Here, we annotate keyframes at 1 Hz as this is suf-ficiently dense to capture the complete semantic contentof actions while enabling us to avoid requiring unrealisti-cally precise temporal annotation of action boundaries. TheTHUMOS challenge [18] observed that action boundaries(unlike objects) are inherently fuzzy, leading to significantinter-annotator disagreement. By contrast, annotators caneasily determine (using ±1.5s of context) whether a framecontains a given action. Effectively, AVA localizes actionstart and end points to an acceptable precision of ±0.5 s.Person-centric action time series. While events such astrees falling do not involve people, our focus is on the ac-tivities of people, treated as single agents. There could bemultiple people as in sports or two people hugging, but eachone is an agent with individual choices, so we treat each sep-arately. The action labels assigned to a person over time isa rich source of data for temporal modeling (Section 4.3).Annotation of movies. Ideally we would want behavior “inthe wild”. We do not have that, but movies are a compellingapproximation, particularly when we consider the diversityof genres and countries with flourishing film industries. Wedo expect some bias in this process. Stories have to be in-teresting and there is a grammar of the film language [2]that communicates through the juxtaposition of shots. Thatsaid, in each shot we can expect an unfolding sequence ofhuman actions, somewhat representative of reality, as con-veyed by competent actors. AVA complements the currentdatasets sourced from user generated video because we ex-

pect movies to contain a greater range of activities as befitsthe telling of diverse stories.Exhaustive action labeling. We label all the actions ofall the people in all the keyframes. This will naturally re-sult in a Zipf’s law type of imbalance across action cat-egories. There will be many more examples of typicalactions (standing or sitting) than memorable ones (danc-ing), but this is how it should be! Recognition modelsneed to operate on realistic “long tailed” action distribu-tions [15] rather than being scaffolded using artificially bal-anced datasets. Another consequence of our protocol is thatsince we do not retrieve examples of action categories byexplicit querying of internet video resources, we avoid acertain kind of bias: opening a door is a common event thatoccurs frequently in movie clips; however a door openingaction that has been tagged as such on YouTube is likelyattention worthy in a way that makes it atypical.

We believe that AVA, with its realistic complexity, ex-poses the inherent difficulty of action recognition hidden bymany popular datasets in the field. A video clip of a sin-gle person performing a visually salient action like swim-ming in typical background is easy to discriminate from,say, one of a person running. Compare with AVA wherewe encounter multiple actors, small in image size, perform-ing actions which are only subtly different such as touch-ing vs. holding an object. To verify this intuition, we docomparative bench-marking on JHMDB [20], UCF101-24categories [32] and AVA. The approach we use for spatio-temporal action localization (see Section 5) builds uponmulti-frame approaches [16, 41], but classifies tubelets withI3D convolutions [6]. We obtain state-of-the-art perfor-mance on JHMDB [20] and UCF101-24 categories [32](see Section 6) while the mAP on AVA is only 15.6%.

The AVA dataset has been released publicly at https://research.google.com/ava/.

2. Related work

Action recognition datasets. Most popular action clas-sification datasets, such as KTH [35], Weizmann [4],Hollywood-2 [26], HMDB [24], UCF101 [39] consist ofshort clips, manually trimmed to capture a single ac-tion. These datasets are ideally suited for training fully-supervised, whole-clip, forced-choice video classifiers. Re-cently, datasets, such as TrecVid MED [29], Sports-1M [21], YouTube-8M [1], Something-something [12],SLAC [48], Moments in Time [28], and Kinetics [22] havefocused on large-scale video classification, often with auto-matically generated – and hence potentially noisy – annota-tions. They serve a valuable purpose but address a differentneed than AVA.

Some recent work has moved towards temporal localiza-tion. ActivityNet [5], THUMOS [18], MultiTHUMOS [46]and Charades [37] use large numbers of untrimmed videos,

https://research.google.com/ava/

https://research.google.com/ava/

each containing multiple actions, obtained either fromYouTube (ActivityNet, THUMOS, MultiTHUMOS) orfrom crowdsourced actors (Charades). The datasets pro-vide temporal (but not spatial) localization for each actionof interest. AVA differs from them, as we provide spatio-temporal annotations for each subject performing an actionand annotations are dense over 15-minute clips.

A few datasets, such as CMU [23], MSR Actions [47],UCF Sports [32] and JHMDB [20] provide spatio-temporalannotations in each frame for short videos. The main differ-ences with our AVA dataset are: the small number of ac-tions; the small number of video clips; and the fact thatclips are very short. Furthermore, actions are compos-ite (e.g., pole-vaulting) and not atomic as in AVA. Recentextensions, such as UCF101 [39], DALY [44] and Hol-lywood2Tubes [27] evaluate spatio-temporal localizationin untrimmed videos, which makes the task significantlyharder and results in a performance drop. However, theaction vocabulary is still restricted to a limited number ofcomposite actions. Moreover, they do not densely coverthe actions; a good example is BasketballDunk in UCF101,where only the dunking player is annotated. However, real-world applications often require a continuous annotations ofatomic actions of all humans, which can then be composedinto higher-level events. This motivates AVA’s exhaustivelabeling over 15-minute clips.

AVA is also related to still image action recognitiondatasets [7, 9, 13] that are limited in two ways. First, thelack of motion can make action disambiguation difficult.Second, modeling composite events as a sequence of atomicactions is not possible in still images. This is arguably outof scope here, but clearly required in many real-world ap-plications, for which AVA does provide training data.Methods for spatio-temporal action localization. Mostrecent approaches [11, 30, 34, 43] rely on object detectorstrained to discriminate action classes at the frame level witha two-stream variant, processing RGB and flow data sepa-rately. The resulting per-frame detections are then linkedusing dynamic programming [11, 38] or tracking [43].All these approaches rely on integrating frame-level detec-tions. Very recently, multi-frame approaches have emerged:Tubelets [41] jointly estimate localization and classificationover several frames, T-CNN [16] use 3D convolutions toestimate short tubes, micro-tubes rely on two successiveframes [33] and pose-guided 3D convolutions add pose toa two-stream approach [49]. We build upon the idea ofspatio-temporal tubes, but employ state-of-the-art I3D con-volution [6] and Faster R-CNN [31] region proposals to out-perform the state of the art.

3. Data collection

Annotation of the AVA dataset consists of five stages:action vocabulary generation, movie and segment selection,

Figure 3. User interface for action annotation. Details in Sec 3.5.

person bounding box annotation, person linking and actionannotation.

3.1. Action vocabulary generation

We follow three principles to generate our action vocab-ulary. The first one is generality. We collect generic actionsin daily-life scenes, as opposed to specific activities in spe-cific environments (e.g., playing basketball on a basketballcourt). The second one is atomicity. Our action classes haveclear visual signatures, and are typically independent of in-teracted objects (e.g., hold without specifying what objectto hold). This keeps our list short yet complete. The lastone is exhaustivity. We initialized our list using knowledgefrom previous datasets, and iterated the list in several roundsuntil it covered ∼99% of actions in the AVA dataset labeledby annotators. We end up with 14 pose classes, 49 person-object interaction classes and 17 person-person interactionclasses in the vocabulary.

3.2. Movie and segment selection

The raw video content of the AVA dataset comes fromYouTube. We begin by assembling a list of top actors ofmany different nationalities. For each name we issue aYouTube search query, retrieving up to 2000 results. Weonly include videos with the “film” or “television” topicannotation, a duration of over 30 minutes, at least 1 yearsince upload, and at least 1000 views. We further excludeblack & white, low resolution, animated, cartoon, and gam-ing videos, as well as those containing mature content.

To create a representative dataset within constraints, ourselection criteria avoids filtering by action keywords, usingautomated action classifiers, or forcing a uniform label dis-tribution. We aim to create an international collection offilms by sampling from large film industries. However, thedepiction of action in film is biased, e.g. by gender [10], anddoes not reflect the “true” distribution of human activity.

Each movie contributes equally to the dataset, as we onlylabel a sub-part ranging from the 15th to the 30th minute.We skip the beginning of the movie to avoid annotatingtitles or trailers. We choose a duration of 15 minutes sowe are able to include more movies under a fixed annota-tion budget, and thus increase the diversity of our dataset.

clink glass → drink

grab (a person) → hug

open → close

look at phone → answer phone

turn → open

fall down → lie/sleepFigure 4. We show examples of how atomic actions change over time in AVA. The text shows pairs of atomic actions for the people in redbounding boxes. Temporal information is key for recognizing many of the actions and appearance can substantially vary within an actioncategory, such as opening a door or bottle.

Each 15-min clip is then partitioned into 897 overlapping 3smovie segments with a stride of 1 second.

3.3. Person bounding box annotation

We localize a person and his or her actions with a bound-ing box. When multiple subjects are present in a keyframe,each subject is shown to the annotator separately for actionannotation, and thus their action labels can be different.

Since bounding box annotation is manually intensive,we choose a hybrid approach. First, we generate an ini-tial set of bounding boxes using the Faster-RCNN persondetector [31]. We set the operating point to ensure high-precision. Annotators then annotate the remaining bound-ing boxes missed by our detector. This hybrid approach en-sures full bounding box recall which is essential for bench-marking, while minimizing the cost of manual annotation.This manual annotation retrieves only 5% more boundingboxes missed by our person detector, validating our designchoice. Any incorrect bounding boxes are marked and re-moved by annotators in the next stage of action annotation.

3.4. Person link annotation

We link the bounding boxes over short periods of timeto obtain ground-truth person tracklets. We calculate thepairwise similarity between bounding boxes in adjacentkey frames using a person embedding [45] and solve forthe optimal matching with the Hungarian algorithm [25].While automatic matching is generally strong, we furtherremove false positives with human annotators who verifyeach match. This procedure results in 81,000 tracklets rang-ing from a few seconds to a few minutes.

3.5. Action annotation

The action labels are generated by crowd-sourced anno-tators using the interface shown in Figure 3. The left panelshows both the middle frame of the target segment (top)and the segment as a looping embedded video (bottom).The bounding box overlaid on the middle frame specifiesthe person whose action needs to be labeled. On the right

are text boxes for entering up to 7 action labels, includ-ing 1 pose action (required), 3 person-object interactions(optional), and 3 person-person interactions (optional). Ifnone of the listed actions is descriptive, annotators can flaga check box called “other action”. In addition, they couldflag segments containing blocked or inappropriate content,or incorrect bounding boxes.

In practice, we observe that it is inevitable for annota-tors to miss correct actions when they are instructed to findall correct ones from a large vocabulary of 80 classes. In-spired by [36], we split the action annotation pipeline intotwo stages: action proposal and verification. We first askmultiple annotators to propose action candidates for eachquestion, so the joint set possesses a higher recall than in-dividual proposals. Annotators then verify these proposedcandidates in the second stage. Results show significant re-call improvement using this two-stage approach, especiallyon actions with fewer examples. See detailed analysis inthe supplemental material. On average, annotators take 22seconds to annotate a given video segment at the proposestage, and 19.7 seconds at the verify stage.

Each video clip is annotated by three independent anno-tators and we only regard an action label as ground truth if itis verified by at least two annotators. Annotators are shownsegments in randomized order.

3.6. Training, validation and test sets

Our training/validation/test sets are split at the videolevel, so that all segments of one video appear only in onesplit. The 430 videos are split into 235 training, 64 valida-tion and 131 test videos, roughly a 55:15:30 split, resultingin 211k training, 57k validation and 118k test segments.

4. Characteristics of the AVA dataset

We first build intuition on the diversity and difficulty ofour AVA dataset through visual examples. Then, we charac-terize the annotations of our dataset quantitatively. Finally,we explore action and temporal structure.

Figure 5. Sizes of each action class in the AVA train/val dataset sorted by descending order, with colors indicating action types.

4.1. Diversity and difficulty

Figure 4 shows examples of atomic actions as theychange over consecutive segments. Besides variations inbounding box size and cinematography, many of the cate-gories will require discriminating fine-grained differences,such as “clinking glass” versus “drinking” or leveragingtemporal context, such as “opening” versus “closing”.

Figure 4 also shows two examples for the action “open”.Even within an action class the appearance varies withvastly different contexts: the object being opened may evenchange. The wide intra-class variety will allow us to learnfeatures that identify the critical spatio-temporal parts of anaction — such as the breaking of a seal for “opening”.

4.2. Annotation Statistics

Figure 5 shows the distribution of action annotations inAVA. The distribution roughly follows Zipf’s law. Figure 6illustrates bounding box size distribution. A large portionof people take up the full height of the frame. However,there are still many boxes with smaller sizes. The variabil-ity can be explained by both zoom level as well as pose.For example, boxes with the label “enter” show the typicalpedestrian aspect ratio of 1:2 with average widths of 30%of the image width, and an average heights of 72%. On theother hand, boxes labeled “lie/sleep” are close to square,with average widths of 58% and heights of 67%. The boxwidths are widely distributed, showing the variety of posespeople undertake to execute the labeled actions.

There are multiple labels for the majority of personbounding boxes. All bounding boxes have one pose label,28% of bounding boxes have at least 1 person-object inter-action label, and 67% of them have at least 1 person-personinteraction label.

4.3. Temporal Structure

A key characteristic of AVA is the rich temporal struc-ture that evolves from segment to segment. Since we havelinked people between segments, we can discover commonconsecutive actions by looking at pairs of actions performedby the same person. We sort pairs by Normalized Pointwise

Figure 6. Size and aspect ratio variations of annotated boundingboxes in the AVA dataset. Note that our bounding boxes consistof a large variation of sizes, many of which are small and hard todetect. Large variation also applies to the aspect ratios of boundingboxes, with mode at 2:1 ratio (e.g., sitting pose).

Mutual Information (NPMI) [8], which is commonly usedin linguistics to represent the co-occurrence between twowords: NPMI(x, y) =

(ln p(x,y)

p(x)p(y)

)/ (− ln p(x, y)). Val-

ues intuitively fall in the range (−1, 1], with −1 for pairsof words that never co-occur, 0 for independent pairs, and 1for pairs that always co-occur.

Table 1 shows pairs of actions with top NPMI in con-secutive one-second segments for the same person. Af-ter removing identity transitions, some interesting commonsense temporal patterns arise. Frequently, there are transi-tions from “look at phone”→ “answer phone”, “fall down”→ “lie”, or “listen to”→ “talk to”. We also analyze inter-person action pairs. Table 2 shows top pairs of actions per-formed at the same time, but by different people. Severalmeaningful pairs emerge, such as “ride” ↔ “drive”, “playmusic” ↔ “listen”, or “take” ↔ “give/serve”. The transi-tions between atomic actions, despite the relatively coarsetemporal sampling, provide excellent data for building morecomplex models of actions and activities with longer tem-poral structure.

5. Action Localization Model

Performance numbers on popular action recognitiondatasets such as UCF101 or JHMDB have gone up consid-erably in recent years, but we believe that this may presentan artificially rosy picture of the state of the art. When thevideo clip involves only a single person performing some-thing visually characteristic like swimming in an equallycharacteristic background scene, it is easy to classify ac-

First Action Second Action NPMIride (eg bike/car/horse) drive (eg car/truck) 0.68watch (eg TV) work on a computer 0.64drive (eg car/truck) ride (eg car bike/car/horse) 0.63open (eg window/door) close (eg door/box) 0.59text on/look at a cellphone answer phone 0.53listen to (person) talk to (person) 0.47fall down lie/sleep 0.46talk to (person) listen to (person) 0.43stand sit 0.40walk stand 0.40

Table 1. We show top pairs of consecutive actions that are likelyto happen before/after for the same person. We sort by NPMI.

Person 1 Action Person 2 Action NPMIride (eg bike/car/horse) drive (eg car/truck) 0.60play musical instrument listen (eg music) 0.57take (object) give/serve (object) 0.51talk to (person) listen to (person) 0.46stand sit 0.31play musical instrument dance 0.23walk stand 0.21watch (person) write 0.15walk run/jog 0.15fight/hit (a person) stand 0.14

Table 2. We show top pairs of simultaneous actions by differentpeople. We sort by NPMI.

curately. Difficulties come in when actors are multiple, orsmall in image size, or performing actions which are onlysubtly different, and when the background scenes are notenough to tell us what is going on. AVA has these aspectsgalore, and we will find that performance at AVA is muchpoorer as a result. Indeed this finding was foreshadowed bythe poor performance at the Charades dataset [37].

To prove our point, we develop a state of the art ac-tion localization approach inspired by recent approaches forspatio-temporal action localization that operate on multi-frame temporal information [16, 41]. Here, we rely on theimpact of larger temporal context based on I3D [6] for ac-tion detection. See Fig. 7 for an overview of our approach.

Following Peng and Schmid [30], we apply the FasterRCNN algorithm [31] for end-to-end localization and clas-sification of actions. However, in their approach, the tem-poral information is lost at the first layer where input chan-nels from multiple frames are concatenated over time. Wepropose to use the Inception 3D (I3D) architecture by Car-reira and Zisserman [6] to model temporal context. TheI3D architecture is designed based on the Inception archi-tecture [40], but replaces 2D convolutions with 3D convo-lutions. Temporal information is kept throughout the net-work. I3D achieves state-of-the-art performance on a widerange of video classification benchmarks.

To use I3D with Faster RCNN, we make the follow-ing changes to the model: first, we feed input frames oflength T to the I3D model, and extract 3D feature maps of

RGBI3D

FlowI3D

RGBResNet-50

TxHxWx3RGBframes

TxHxWx2Flowframes

Keyframe

RegionProposalNetwork

ROIPooling

ROIPooling

Mixed4e

Mixed4e

conv4

+

AvgPooling

T’xH’xW’xC

H’xW’xC

AvgPooling

Classification

BoxRefinement

T’xH’xW’xC H’xW’xC

Figure 7. Illustration of our approach for spatio-temporal actionlocalization. Region proposals are detected and regressed withFaster-RCNN on RGB keyframes. Spatio-temporal tubes are clas-sified with two-stream I3D convolutions.

size T ′ ×W ′ × H ′ × C at the Mixed 4e layer of the net-work. The output feature map at Mixed 4e has a stride of 16,which is equivalent to the conv4 block of ResNet [14]. Sec-ond, for action proposal generation, we use a 2D ResNet-50model on the keyframe as the input for the region proposalnetwork, avoiding the impact of I3D with different inputlengths on the quality of generated action proposals. Fi-nally, we extend ROI Pooling to 3D by applying the 2D ROIPooling at the same spatial location over all time steps. Tounderstand the impact of optical flow for action detection,we fuse the RGB stream and the optical flow stream at thefeature map level using average pooling.Baseline. To compare to a frame-based two-stream ap-proach on AVA, we implement a variant of [30]. We useFaster RCNN [31] with ResNet-50 [14] to jointly learn ac-tion proposals and action labels. Region proposals are ob-tained with the RGB stream only. The region classifier takesas input RGB along with optical flow features stacked over5 consecutive frames. As for our I3D approach, we jointlytrain the RGB and the optical flow streams by fusing theconv4 feature maps with average pooling.Implementation details. We implement FlowNet v2 [19]to extract optical flow features. We train Faster-RCNN withasynchronous SGD. For all training tasks, we use a valida-tion set to determine the number of training steps, whichranges from 600K to 1M iterations. We fix the input reso-lution to be 320 by 400 pixels. All the other model param-eters are set based on the recommended values from [17],which were tuned for object detection. The ResNet-50networks are initialized with ImageNet pre-trained mod-els. For the optical flow stream, we duplicate the conv1filters to input 5 frames. The I3D networks are initializedwith Kinetics [22] pre-trained models, for both the RGBand optical flow streams. Note that although I3D were pre-trained on 64-frame inputs, the network is fully convolu-tional over time and can take any number of frames as in-put. All feature layers are jointly updated during training.The output frame-level detections are post-processed withnon-maximum suppression with threshold 0.6.

One key difference between AVA and existing action de-tection datasets is that the action labels of AVA are not mu-

tually exclusive. To address this, we replace the standardsoftmax loss function by a sum of binary Sigmoid losses,one for each class. We use Sigmoid loss for AVA and soft-max loss for all other datasets.Linking. Once we have per frame-level detections, we linkthem to construct action tubes. We report video-level per-formance based on average scores over the obtained tubes.We use the same linking algorithm as described in [38], ex-cept that we do not apply temporal labeling. Since AVA isannotated at 1 Hz and each tube may have multiple labels,we modify the video-level evaluation protocol to estimatean upper bound. We use ground truth links to infer detectionlinks, and when computing IoU score of a class between aground truth tube and a detection tube, we only take tubesegments that are labeled by that class into account.

6. Experiments and AnalysisWe now experimentally analyze key characteristics of

AVA and motivate challenges for action understanding.

6.1. Datasets and Metrics

AVA benchmark. Since the label distribution in AVAroughly follows Zipf’s law (Figure 5) and evaluation on avery small number of examples could be unreliable, we useclasses that have at least 25 instances in validation and testsplits to benchmark performance. Our resulting benchmarkconsists of a total of 210,634 training, 57,371 validation and117,441 test examples on 60 classes. Unless otherwise men-tioned, we report results trained on the training set and eval-uated on the validation set. We randomly select 10% of thetraining data for model parameter tuning.Datasets. Besides AVA, we also analyze standard videodatasets in order to compare difficulty. JHMDB [20] con-sists of 928 trimmed clips over 21 classes. We report resultsfor split one in our ablation study, but results are averagedover three splits for comparison to the state of the art. ForUCF101, we use spatio-temporal annotations for a 24-classsubset with 3207 videos, provided by Singh et al. [38]. Weconduct experiments on the official split1 as is standard.Metrics. For evaluation, we follow standard practice whenpossible. We report intersection-over-union (IoU) perfor-mance on frame level and video level. For frame-level IoU,we follow the standard protocol used by the PASCAL VOCchallenge [9] and report the average precision (AP) usingan IoU threshold of 0.5. For each class, we compute the av-erage precision and report the average over all classes. Forvideo-level IoU, we compute 3D IoUs between ground truthtubes and linked detection tubes at the threshold of 0.5. Themean AP is computed by averaging over all classes.

6.2. Comparison to the state-of-the-art

Table 3 shows our model performance on two standardvideo datasets. Our 3D two-stream model obtains state-

Frame-mAP JHMDB UCF101-24Actionness [42] 39.9% -Peng w/o MR [30] 56.9% 64.8%Peng w/ MR [30] 58.5% 65.7%ACT [41] 65.7% 69.5%Our approach 73.3% 76.3%

Video-mAP JHMDB UCF101-24Peng w/ MR [30] 73.1% 35.9%Singh et al. [38] 72.0% 46.3%ACT [41] 73.7% 51.4%TCNN [16] 76.9% -Our approach 78.6% 59.9%

Table 3. Frame-mAP (top) and video-mAP (bottom) @ IoU 0.5for JHMDB and UCF101-24. For JHMDB, we report averagedperformance over three splits. Our approach outperforms previousstate-of-the-art on both metrics by a considerable margin.

of-the-art performance on UCF101 and JHMDB, outper-forming well-established baselines for both frame-mAP andvideo-mAP metrics.

However, the picture is less auspicious when recogniz-ing atomic actions. Table 4 shows that the same modelobtains relatively low performance on AVA validation set(frame-mAP of 15.6%, video-mAP of 12.3% at 0.5 IoUand 17.9% at 0.2 IoU), as well as test set (frame-mAP of14.7%). We attribute this to the design principles behindAVA: we collected a vocabulary where context and objectcues are not as discriminative for action recognition. In-stead, recognizing fine-grained details and rich temporalmodels may be needed to succeed at AVA, posing a newchallenge for visual action recognition. In the remainderof this paper, we analyze what makes AVA challenging anddiscuss how to move forward.

6.3. Ablation study

How important is temporal information for recognizingAVA categories? Table 4 shows the impact of the temporallength and the type of model. All 3D models outperformthe 2D baseline on JHMDB and UCF101-24. For AVA, 3Dmodels perform better after using more than 10 frames. Wecan also see that increasing the length of the temporal win-dow helps for the 3D two-stream models across all datasets.As expected, combining RGB and optical flow features im-proves the performance over a single input modality. More-over, AVA benefits more from larger temporal context thanJHMDB and UCF101, whose performances saturate at 20frames. This gain and the consecutive actions in Table 1suggests that one may obtain further gains by leveragingthe rich temporal context in AVA.How challenging is localization versus recognition? Ta-ble 5 compares the performance of end-to-end action local-ization and recognition versus class agnostic action local-ization. We can see that although action localization is more

Figure 8. Top: We plot the performance of models for each action class, sorting by the number of training examples. Bottom: We plotthe number of training examples per class. While more data is better, the outliers suggest that not all classes are of equal complexity. Forexample, one of the smallest classes “swim” has one of the highest performances because the associated scenes make it relatively easy.

Model Temp.+ Mode JHMDB UCF101-24 AVA2D 1 RGB + 5 Flow 52.1% 60.1% 13.7%3D 5 RGB + 5 Flow 67.9% 76.1% 13.6%3D 10 RGB + 10 Flow 73.4% 78.0% 14.6%3D 20 RGB + 20 Flow 76.4% 78.3% 15.2%3D 40 RGB + 40 Flow 76.7% 76.0% 15.6%3D 50 RGB + 50 Flow - 73.2% 15.5%3D 20 RGB 73.2% 77.0% 14.5%3D 20 Flow 67.0% 71.3% 9.9%

Table 4. Frame-mAP @ IoU 0.5 for action detection on JHMDB(split1), UCF101 (split1) and AVA. Note that JHMDB has up to40 frames per clip. For UCF101-24, we randomly sample 20,000frame subset for evaluation. Although our model obtains state-of-the-art performance on JHMDB and UCF101-24, the fine-grainednature of AVA makes it a challenge.

JHMDB UCF101-24 AVAAction detection 76.7% 76.3% 15.6%Actor detection 92.8% 84.8% 75.3%

Table 5. Frame-mAP @ IoU 0.5 for action detection and actor de-tection performance on JHMDB (split1), UCF101-24 (split1) andAVA benchmarks. Since human annotators are consistent, our re-sults suggest there is significant headroom to improve on recon-gizing atomic visual actions.

challenging on AVA than on JHMDB, the gap between lo-calization and end-to-end detection performance is nearly60% on AVA, while less than 15% on JHMDB and UCF101.This suggests that the main difficulty of AVA lies in actionclassification rather than localization. Figure 9 shows exam-ples of high-scoring false alarms, suggesting that the diffi-culty in recognition lies in the fine-grained details.Which categories are challenging? How important isnumber of training examples? Figure 8 breaks down per-formance by categories and the number of training exam-ples. While more data generally yields better performance,the outliers reveals that not all categories are of equal com-plexity. Categories correlated with scenes and objects (suchas swimming) or categories with low diversity (such as falldown) obtain high performance despite having fewer train-ing examples. In contrast, categories with lots of data,

Figure 9. Red boxes show high-scoring false alarms for smoking.The model often struggles to discriminate fine-grained details.

such as touching and smoking, obtain relatively low per-formance possibly because they have large visual variationsor require fine grained discrimination, motivating work onperson-object interaction [7, 12]. We hypothesize that thegains on recognizing atomic actions will need not only largedatasets, such as AVA, but also rich models of motion andinteractions.

7. Conclusion

This paper introduces the AVA dataset with spatio-temporal annotations of atomic actions at 1 Hz over diverse15-min. movie segments. In addition we propose a methodthat outperforms the current state of the art on standardbenchmarks to serve as a baseline. This method highlightsthe difficulty of the AVA dataset as its performance is sig-nificantly lower than on UCF101 or JHMDB, underscoringthe need for developing new action recognition approaches.

Future work includes modeling more complex activitiesbased on our atomic actions. Our present day visual clas-sification technology may enable us to classify events suchas “eating in a restaurant” at the coarse scene/video level,but models based on AVA’s fine spatio-temporal granularityfacilitate understanding at the level of an individual agentsactions. These are essential steps towards imbuing comput-ers with “social visual intelligence” – understanding whathumans are doing, what they might do next, and what theyare trying to achieve.

Acknowledgement We thank Abhinav Gupta, AbhinavShrivastava, Andrew Gallagher, Irfan Essa, and Vicky Kalo-geiton for discussion and comments about this work.

References[1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici,

B. Varadarajan, and S. Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark.arXiv:1609.08675, 2016. 2

[2] D. Arijon. Grammar of the film language. Silman-JamesPress, 1991. 2

[3] R. Barker and H. Wright. Midwest and its children: Thepsychological ecology of an American town. Row, Petersonand Company, 1954. 2

[4] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.Actions as space-time shapes. In ICCV, 2005. 2

[5] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles.ActivityNet: A large-scale video benchmark for human ac-tivity understanding. In CVPR, 2015. 2

[6] J. Carreira and A. Zisserman. Quo vadis, action recognition?A new model and the Kinetics dataset. In CVPR, 2017. 2, 3,6

[7] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng. HICO:A benchmark for recognizing human-object interactions inimages. In ICCV, 2015. 3, 8

[8] K.-W. Church and P. Hanks. Word association norms, mu-tual information, and lexicoraphy. Computational Linguis-tics, 16(1), 1990. 5

[9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.Williams, J. Winn, and A. Zisserman. The PASCAL VisualObject Classes Challenge: A retrospective. IJCV, 2015. 3, 7

[10] Geena Davis Institute on Gender in Media. The Reel Truth:Women Aren’t Seen or Heard. https://seejane.org/research-informs-empowers/data/, 2016.3

[11] G. Gkioxari and J. Malik. Finding action tubes. In CVPR,2015. 3

[12] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska,S. Westphal, H. Kim, V. Haenel, I. Frund, P. Yianilos,M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, andR. Memisevic. The “something something” video databasefor learning and evaluating visual common sense. In ICCV,2017. 2, 8

[13] S. Gupta and J. Malik. Visual semantic role labeling. CoRR,abs/1505.04474, 2015. 3

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016. 6

[15] G. V. Horn and P. Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv:1709.01450, 2017.2

[16] R. Hou, C. Chen, and M. Shah. Tube convolutional neuralnetwork (T-CNN) for action detection in videos. In ICCV,2017. 2, 3, 6, 7

[17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, andK. Murphy. Speed/accuracy trade-offs for modern convolu-tional object detectors. In CVPR, 2017. 6

[18] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev,R. Sukthankar, and M. Shah. The THUMOS challenge onaction recognition for videos “in the wild”. CVIU, 2017. 2

[19] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. FlowNet 2.0: Evolution of optical flow estimationwith deep networks. In CVPR, 2017. 6

[20] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. Black. To-wards understanding action recognition. In ICCV, 2013. 2,3, 7

[21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification with convo-lutional neural networks. In CVPR, 2014. 2

[22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev,M. Suleyman, and A. Zisserman. The Kinetics human actionvideo dataset. arXiv:1705.06950, 2017. 2, 6

[23] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual eventdetection using volumetric features. In ICCV, 2005. 3

[24] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: A large video database for human motion recogni-tion. In ICCV, 2011. 2

[25] H. W. Kuhn. The Hungarian method for the assignment prob-lem. Naval Research Logistics (NRL), 2(1-2):83–97, 1955.4

[26] M. Marszalek, I. Laptev, and C. Schmid. Actions in context.In CVPR, 2009. 2

[27] P. Mettes, J. van Gemert, and C. Snoek. Spot On: Actionlocalization from pointly-supervised proposals. In ECCV,2016. 3

[28] M. Monfort, B. Zhou, S. A. Bargal, T. Yan, A. Andonian,K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Von-drick, et al. Moments in time dataset: one million videos forevent understanding. 2

[29] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders,W. Kraaij, A. Smeaton, and G. Quenot. TRECVID 2014 –an overview of the goals, tasks, data, evaluation mechanismsand metrics, 2014. 2

[30] X. Peng and C. Schmid. Multi-region two-stream R-CNNfor action detection. In ECCV, 2016. 3, 6, 7

[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In NIPS, 2015. 3, 4, 6

[32] M. Rodriguez, J. Ahmed, and M. Shah. Action MACH: aspatio-temporal maximum average correlation height filterfor action recognition. In CVPR, 2008. 2, 3

[33] S. Saha, G.Sing, and F. Cuzzolin. AMTnet: Action-micro-tube regression by end-to-end trainable deep architecture. InICCV, 2017. 3

[34] S. Saha, G. Singh, M. Sapienza, P. Torr, and F. Cuzzolin.Deep learning for detecting multiple space-time action tubesin videos. In BMVC, 2016. 3

[35] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: a local SVM approach. In ICPR, 2004. 2

[36] G. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, andA. Gupta. Much ado about time: Exhaustive annotation oftemporal data. In Conference on Human Computation andCrowdsourcing, 2016. 4

[37] G. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, andA. Gupta. Hollywood in homes: Crowdsourcing data collec-tion for activity understanding. In ECCV, 2016. 2, 6

https://seejane.org/research-informs-empowers/data/

https://seejane.org/research-informs-empowers/data/

[38] G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin.Online real-time multiple spatiotemporal action localisationand prediction. In ICCV, 2017. 3, 7

[39] K. Soomro, A. Zamir, and M. Shah. UCF101: A dataset of101 human actions classes from videos in the wild. Techni-cal Report CRCV-TR-12-01, University of Central Florida,2012. 2, 3

[40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision. InCVPR, 2016. 6

[41] V.Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Ac-tion tubelet detector for spatio-temporal action localization.In ICCV, 2017. 2, 3, 6, 7

[42] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness esti-mation using hybrid fully convolutional networks. In CVPR,2016. 7

[43] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning totrack for spatio-temporal action localization. In ICCV, 2015.3

[44] P. Weinzaepfel, X. Martin, and C. Schmid. Towards weakly-supervised action localization. arXiv:1605.05197, 2016. 3

[45] L. Wu, C. Shen, and A. van den Hengel. PersonNet: Personre-identification with deep convolutional neural networks.arXiv preprint arXiv:1601.07255, 2016. 4

[46] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori,and L. Fei-Fei. Every moment counts: Dense detailed label-ing of actions in complex videos. IJCV, 2017. 2

[47] J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume searchfor efficient action detection. In CVPR, 2009. 3

[48] H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba.SLAC: A sparsely labeled dataset for action classificationand localization. arXiv preprint arXiv:1712.09374, 2017. 2

[49] M. Zolfaghari, G. Oliveira, N. Sedaghat, and T. Brox.Chained multi-stream networks exploiting pose, motion, andappearance for action classification and detection. In ICCV,2017. 3

Appendix

In the following, we present additional quantitative infor-mation and examples for our AVA dataset as well as for ouraction detection approach on AVA.

8. Additional details on the annotationFigure 10 shows the user interface for bounding box an-

notation. As described in Section 3.3, we employ a hy-brid approach to tradeoff accuracy with annotation cost. Weshow annotators frames overlaid by detected person bound-ing boxes, so they can add boxes to include more personsmissed by the detector.

Figure 10. User interface for bounding box annotation. The purplebox was generated by the person detector. The orange box (missedby the detector) was manually added by an annotator.

In Section 3.5 of our paper submission, we explain whyour two-stage action annotation design is crucial for pre-serving high recall of action classes. Here we show quanti-tative analysis. Figure 11 shows the proportion of labels peraction class generated from each stage. (Blue ones are gen-erated from the first (propose) stage and red ones from thesecond (verify) stage). As we can see, for more than half ofour action labels, the majority labels are derived from theverification stage. Furthermore, the smaller the action classsize, the more likely that they are missed by the first stage(e.g., kick, exit, extract), and require the second stage toboost recall. The second stage helps us to build more robustmodels for long tail classes that are more sensitive to thesizes of the training data.

9. Additional details on the datasetTable 6 and 7 present the number of instances for each

class of the AVA trainval dataset. We observe a signifi-cant class imbalance to be expected in real-world data [c.f.Zipf’s Law]. As stated in the paper, we select a subset of

Figure 11. Action class recall improvement due to the two-stage process. For each class, the blue bar shows the proportion of labelsannotated without verification (majority voted results over raters’ selections from 80 classes.), and the red bar shows the proportion oflabels revived from the verification stage. More than half of the action classes doubles their recalls thanks to the additional verification.

these classes (without asterisks) for our benchmarking ex-periment, in order to have a sufficient number of test ex-amples. Note that we consider the presence of the “rare”classes as an opportunity for approaches to learn from a fewtraining examples.

Figure 12 shows more examples of common consecutiveatomic actions in AVA.

10. Examples of our action detectionFigure 13 and Figure 14 show the top true positives and

false alarms returned by our best Faster-RCNN with I3Dmodel.

Pose #stand 208332sit 127917walk 52644bend/bow (at the waist) 10558lie/sleep 7176dance 4672run/jog 4293crouch/kneel 3125martial art 2520get up 1439fall down 378jump/leap 313crawl* 169swim 141

Person-Person Interaction #watch (a person) 202369talk to (e.g. self/person) 136780listen to (a person) 132174fight/hit (a person) 3069grab (a person) 2590sing to (e.g., self, a person, a group) 2107hand clap 1539hug (a person) 1472give/serve (an object) to (a person) 1337kiss (a person) 898take (an object) from (a person) 783hand shake 719lift (a person) 481push (another person) 434hand wave 430play with kids* 151kick (a person)* 60

Table 6. Number of instances for pose (left) and person-person (right) interaction labels in the AVA trainval dataset, sorted in decreasingorder. Labels marked by asterisks are not included in the benchmark dataset.

Person-Object Interaction #carry/hold (an object) 100598touch (an object) 21099ride (e.g., a bike, a car, a horse) 6594answer phone 4351eat 3889smoke 3528read 2730drink 2681watch (e.g., TV) 2105play musical instrument 2063drive (e.g., a car, a truck) 1728open (e.g., a window, a car door) 1547write 1014close (e.g., a door, a box) 986listen (e.g., to music) 950sail boat 699put down 653lift/pick up 634text on/look at a cellphone 517push (an object) 465pull (an object) 460dress/put on clothing 420throw 336climb (e.g., a mountain) 315work on a computer 278enter 271

Person-Object Interaction #shoot 267hit (an object) 220take a photo 213cut 212turn (e.g., a screwdriver) 167play with pets* 146point to (an object) 128play board game* 127press* 102catch (an object)* 97fishing* 88cook* 79paint* 79shovel* 79row boat* 77dig* 72stir* 71clink glass* 67exit* 65chop* 47kick (an object)* 40brush teeth* 21extract* 13

Table 7. Number of instances for person-object interactions in the AVA trainval dataset, sorted in decreasing order. Labels marked byasterisks are not included in the benchmark.

answer (eg phone)→ look at (eg phone)

answer (eg phone)→ put down

clink glass→ drink

crouch/kneel→ crawl

grab→ handshake

grab→ hug

open→ close

Figure 12. We show more examples of how atomic actions change over time in AVA. The text shows pairs of atomic actions for the peoplein red bounding boxes.

cut

throw

hand clap

work on a computerFigure 13. Most confident action detections on AVA. True positives are in green, false alarms in red.

open (e.g. window, door)

get up

smoke

take (something) from (someone)Figure 14. Most confident action detections on AVA. True positives are in green, false alarms in red.

ava: a video dataset of spatio-temporally localized atomic ... · ava, with its realistic scene and...

Documents