fcv scene schmid

19
Action recognition in videos Cordelia Schmid INRIA Grenoble

Upload: zukun

Post on 10-May-2015

317 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Fcv scene schmid

Action recognition in videos

Cordelia Schmid

INRIA Grenoble

Page 2: Fcv scene schmid

Action recognition - problem

• Short actions, i.e. drinking, sit down

Coffee & Cigarettes dataset Hollywood dataset

Page 3: Fcv scene schmid

Action recognition - problem

• Short actions, i.e. drinking, sit down

• Activities/events, i.e. making a sandwich, depositing a suspicious object

TRECVID Multimedia Event Detection

Page 4: Fcv scene schmid

TRECVID - Multimedia Event Detection

Attempting a board trick Feeding an animal

Wedding ceremony Getting a vehicle unstuck

Page 5: Fcv scene schmid

Action recognition• Action recognition is person-centric

• Vision is person-centric: We mostly care about things which are important

Movies TV

YouTubeSource I.Laptev

Page 6: Fcv scene schmid

Action recognition• Action recognition is person-centric

• Vision is person-centric: We mostly care about things which are important

40%

35% 34%Movies TV

YouTubeSource I.Laptev

Page 7: Fcv scene schmid

Action recognition from still images

• Description of the human pose

– Silhouette description [Sullivan & Carlsson, 2002]– Histogram of gradients (HOG) [Dalal & Triggs 2005]

– Human body part layout [Felzenszwalb & Huttenlocher, 2000]

Page 8: Fcv scene schmid

Action recognition from still images

• Supervised modeling interaction between human & object [Gupta et al. 2009, Yao & Fei-Fei 2009]

• Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011]

Results on PASCAL VOC 2010 Human action classification dataset

Page 9: Fcv scene schmid

Importance of action objects

• Human pose often not sufficient by itself

• Objects define the actions

Page 10: Fcv scene schmid

Importance of temporal information

• Video/temporal information necessary to disambiguate actions

• Temporal context describes the action/activity

• Key frames provide significant less information

Page 11: Fcv scene schmid

Action recognition in videos

• Temporal information allows to stabilize human and object detection by tracking – J. Malik: tracking by detection is difficult ?

• Large amount of data, very fast growing – H. Sawhney: large amount of data, not often well explored

• Often comes with some sort of supervision, scripts, subtitles – similar in spirit to M. Hebert’s comment on large amount of data collected by a robot

Page 12: Fcv scene schmid

Action recognition in videos

Motion history image[Bobick & Davis, 2001]

Spatial motion descriptor [Efros et al. ICCV 2003]

Learning dynamic prior [Blake et al. 1998]

Sign language recognition[Zisserman et al. 2009]

Page 13: Fcv scene schmid

Action recognition in videos

• Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]

Histogram of visual words

SVM classifier

Collection of space-time patches

HOG & HOFpatch descriptors

Page 14: Fcv scene schmid

Action recognition in videos

• Bag of space-time features– Many recent extensions: new features / tracklets, temporal

structuring etc.

– Advantages• Very useful as a baseline

• Captures spatial and temporal context – see Efros comment on image classification

– Disadvantages• No interpretation of the action

• Not sufficient for localization & description

Page 15: Fcv scene schmid

Action recognition in videos

• Localization by 3D HOG/HOF, interaction with objects

Tracking by detection and tracking Space-time description

Interaction with objects

Page 16: Fcv scene schmid
Page 17: Fcv scene schmid

Action recognition in videos

• HOG 3D tracks + description

– Detection & tracking of human with part based models works well• move towards more flexible models, similar to P. Felzenszwalb’s

• integrate motion information

• very good baseline

– Towards more flexible descriptions, based on body parts, a lot of recent work on finding human body parts

– Interaction with object important, but hard

Page 18: Fcv scene schmid

Discussion

• Need for more challenging datasets– Need for realistic datasets

– Scale up number of classes (today ~10 actions per dataset)– Increase number of examples per class, possibly with weakly

supervised learning (the number of examples per videos is low)– Define a taxonomy, use redundancy between action classes to

improve training– Manual exhaustive labeling of all actions impossible

KTH dataset Hollywood dataset

Page 19: Fcv scene schmid

Discussion

• Make better use of the large amount of information inherent in videos– automatic collection of additional examples– improve models incrementally – use weak labels from associated data (text, sound, subtitles)

• Many existing techniques are straightforward extensions of methods for images – almost no use of 3D information – learn better interaction and temporal models– design activity models by decomposition into simple actions