fcv scene schmid

Action recognition in videos

Cordelia Schmid

INRIA Grenoble

Action recognition - problem

• Short actions, i.e. drinking, sit down

Coffee & Cigarettes dataset Hollywood dataset

Action recognition - problem

• Short actions, i.e. drinking, sit down

• Activities/events, i.e. making a sandwich, depositing a suspicious object

TRECVID Multimedia Event Detection

TRECVID - Multimedia Event Detection

Attempting a board trick Feeding an animal

Wedding ceremony Getting a vehicle unstuck

Action recognition• Action recognition is person-centric

• Vision is person-centric: We mostly care about things which are important

Movies TV

YouTubeSource I.Laptev

Action recognition• Action recognition is person-centric

• Vision is person-centric: We mostly care about things which are important

40%

35% 34%Movies TV

YouTubeSource I.Laptev

Action recognition from still images

• Description of the human pose

– Silhouette description [Sullivan & Carlsson, 2002]– Histogram of gradients (HOG) [Dalal & Triggs 2005]

– Human body part layout [Felzenszwalb & Huttenlocher, 2000]

Action recognition from still images

• Supervised modeling interaction between human & object [Gupta et al. 2009, Yao & Fei-Fei 2009]

• Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011]

Results on PASCAL VOC 2010 Human action classification dataset

Importance of action objects

• Human pose often not sufficient by itself

• Objects define the actions

Importance of temporal information

• Video/temporal information necessary to disambiguate actions

• Temporal context describes the action/activity

• Key frames provide significant less information


• Temporal information allows to stabilize human and object detection by tracking – J. Malik: tracking by detection is difficult ?

• Large amount of data, very fast growing – H. Sawhney: large amount of data, not often well explored

• Often comes with some sort of supervision, scripts, subtitles – similar in spirit to M. Hebert’s comment on large amount of data collected by a robot


Motion history image[Bobick & Davis, 2001]

Spatial motion descriptor [Efros et al. ICCV 2003]

Learning dynamic prior [Blake et al. 1998]

Sign language recognition[Zisserman et al. 2009]


• Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]

Histogram of visual words

SVM classifier

Collection of space-time patches

HOG & HOFpatch descriptors


• Bag of space-time features– Many recent extensions: new features / tracklets, temporal

structuring etc.

– Advantages• Very useful as a baseline

• Captures spatial and temporal context – see Efros comment on image classification

– Disadvantages• No interpretation of the action

• Not sufficient for localization & description


• Localization by 3D HOG/HOF, interaction with objects

Tracking by detection and tracking Space-time description

Interaction with objects


• HOG 3D tracks + description

– Detection & tracking of human with part based models works well• move towards more flexible models, similar to P. Felzenszwalb’s

• integrate motion information

• very good baseline

– Towards more flexible descriptions, based on body parts, a lot of recent work on finding human body parts

– Interaction with object important, but hard

Discussion

• Need for more challenging datasets– Need for realistic datasets

– Scale up number of classes (today ~10 actions per dataset)– Increase number of examples per class, possibly with weakly

supervised learning (the number of examples per videos is low)– Define a taxonomy, use redundancy between action classes to

improve training– Manual exhaustive labeling of all actions impossible

KTH dataset Hollywood dataset

Discussion

• Make better use of the large amount of information inherent in videos– automatic collection of additional examples– improve models incrementally – use weak labels from associated data (text, sound, subtitles)

• Many existing techniques are straightforward extensions of methods for images – almost no use of 3D information – learn better interaction and temporal models– design activity models by decomposition into simple actions

fcv scene schmid

Technology

action classes

videos temporal information

videos hog

temporal models

videos localization

motion information

d information

actions temporal context