lecture on action recognition

Upload: sudhamsh-maddala

Post on 21-Feb-2018

224 views

Category:

Documents


2 download

TRANSCRIPT

  • 7/24/2019 Lecture on Action Recognition

    1/72

    Action recognition

    ECS734 Techniques in Computer Vision

    Ioannis Patras

    [email protected]

    Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos

  • 7/24/2019 Lecture on Action Recognition

    2/72

    Past lectures

    Recognition in static images

    Object recognition

    Image categorisation

  • 7/24/2019 Lecture on Action Recognition

    3/72

    Todays lectures

    Recognition in image sequences

    Action recognition(body gestures / facial expressions)

  • 7/24/2019 Lecture on Action Recognition

    4/72

    To come

    Recognition in image sequences

    Action recognition

    (body gestures / facial expressions)

    Tracking

    Structure from motion

    Surveillance

  • 7/24/2019 Lecture on Action Recognition

    5/72

    Todays lecture

    Introductionapplication

    Features Template classification methods

    Recognition using pose estimation and

    objects (in brief) Part-based action localisation

  • 7/24/2019 Lecture on Action Recognition

    6/72

    Todays lecture

    Introductionapplication

    Features Template classification methods

    Recognition using pose estimation and

    objects Part-based action localisation

  • 7/24/2019 Lecture on Action Recognition

    7/72

    What is an action?

    Action: a transition from one state to another

    Who is the actor? How is the state of the actor changing?

    What (if anything) is being acted on?

    How is that thing changing?

    What is the purpose of the action (if any)?

  • 7/24/2019 Lecture on Action Recognition

    8/72

    Human activity in video

    No universal terminology, but approximately:

    Actions: atomic motion patterns -- often gesture-

    like, single clear-cut trajectory, single nameable

    behavior (e.g., break eggs, lift spoon, kick, wavearms)

    Activity: series or composition of actions (e.g.,

    people having a conversation, interacting) Event: combination of activities or actions (e.g., a

    football game, a traffic accident, cooking a meal)

    Adapted from Venu Govindaraju

  • 7/24/2019 Lecture on Action Recognition

    9/72

    How do we represent actions?

    CategoriesWalking, hammering, dancing, skiing, sitting

    down, standing up, jumping

    Poses

    Nouns and Predicates

  • 7/24/2019 Lecture on Action Recognition

    10/72

    Applications

    Human-Computer

    interfaces

    Augmented Reality

    Sports Analysis

    [ C. Sminchisescu, 2007 ]

  • 7/24/2019 Lecture on Action Recognition

    11/72

    Surveillance

    http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf

  • 7/24/2019 Lecture on Action Recognition

    12/72

    2011

    Interfaces

  • 7/24/2019 Lecture on Action Recognition

    13/72

  • 7/24/2019 Lecture on Action Recognition

    14/72

    How can we identify actions?

    Motion Pose

    Held

    ObjectsNearby

    Objects

  • 7/24/2019 Lecture on Action Recognition

    15/72

    Todays lecture

    Introductionapplication

    Features Template classification methods

    Recognition using pose estimation and

    objects Part-based action localisation

  • 7/24/2019 Lecture on Action Recognition

    16/72

    Representing Motion

    Bobick Davis 2001

    Optical Flow with Motion History

    http://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdf
  • 7/24/2019 Lecture on Action Recognition

    17/72

    Representing Motion

    Efros et al. 2003

    Optical Flow with Split Channels

    http://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdfhttp://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdf
  • 7/24/2019 Lecture on Action Recognition

    18/72

    Representing MotionTracked Points

    Matikainen et al. 2009

    http://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdfhttp://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdf
  • 7/24/2019 Lecture on Action Recognition

    19/72

    Representing MotionSpace-Time Interest Points

    Corner detectors in

    space-time

    Laptev 2005

    http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf
  • 7/24/2019 Lecture on Action Recognition

    20/72

    Representing MotionSpace-Time Interest Points

    Laptev 2005

    http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf
  • 7/24/2019 Lecture on Action Recognition

    21/72

    Representing MotionSpace-Time Volumes

    Blank et al. 2005

    http://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdfhttp://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdf
  • 7/24/2019 Lecture on Action Recognition

    22/72

    Examples of Action Recognition

    Systems

    Feature-based classification

    Recognition using pose and objects

    Part-based recognition and localisation

  • 7/24/2019 Lecture on Action Recognition

    23/72

    Todays lecture

    Introductionapplication

    Features Feature classification methods

    Recognition using pose estimation and

    objects Part-based action localisation

  • 7/24/2019 Lecture on Action Recognition

    24/72

    Action recognition as classification

    Retrieving actions in movies, Laptev and Perez, 2007

    http://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdf
  • 7/24/2019 Lecture on Action Recognition

    25/72

    Remember image categorization

    TrainingLabelsTraining

    Images

    Classifier

    Training

    Training

    Image

    Features

    Trained

    Classifier

  • 7/24/2019 Lecture on Action Recognition

    26/72

    Remember image categorization

    TrainingLabelsTraining

    Images

    Classifier

    Training

    Training

    Image

    Features

    Image

    Features

    Testing

    Test Image

    Trained

    Classifier

    Trained

    Classifier Outdoor

    Prediction

  • 7/24/2019 Lecture on Action Recognition

    27/72

    Spatial pyramids.

    Compute histogram in each spatial bin

  • 7/24/2019 Lecture on Action Recognition

    28/72

    Features for Classifying Actions

    1. Spatio-temporal pyramids (14x14x8 bins)

    Image Gradients

    Optical Flow

  • 7/24/2019 Lecture on Action Recognition

    29/72

    Features for Classifying Actions

    2. Spatio-temporal interest points

    Corner detectors in

    space-time

    Descriptors based on Gaussian derivative filters over x, y, time

  • 7/24/2019 Lecture on Action Recognition

    30/72

    Classification

    Boosted stubs for pyramids of optical flow,

    gradient

    Nearest neighbor for STIP

  • 7/24/2019 Lecture on Action Recognition

    31/72

    Searching the video for an action1. Detect keyframes using a trained HOG

    detector in each frame

    2. Classify detected keyframes as positive (e.g.,

    drinking) or negative (other)

  • 7/24/2019 Lecture on Action Recognition

    32/72

    Accuracy in searching video

    Withoutkeyframe

    detection

    Withkeyframe

    detection

  • 7/24/2019 Lecture on Action Recognition

    33/72

    Learning realistic human actions from movies, Laptev et al. 2008

    Talk on phone

    Get out of car

    http://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdfhttp://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdf
  • 7/24/2019 Lecture on Action Recognition

    34/72

    Approach

    Space-time interest point detectors

    Descriptors

    HOG, HOF

    Pyramid histograms (3x3x2)

    SVMs with Chi-Squared Kernel

    Interest Points

    Spatio-Temporal Binning

  • 7/24/2019 Lecture on Action Recognition

    35/72

    Results

  • 7/24/2019 Lecture on Action Recognition

    36/72

    Todays lecture

    Introductionapplication

    Features Template classification methods

    Recognition using pose estimation and

    objects Part-based action localisation

  • 7/24/2019 Lecture on Action Recognition

    37/72

  • 7/24/2019 Lecture on Action Recognition

    38/72

    Human-Object Interaction

    Torso

    Head

    Human pose estimation

    Holistic image based classification

    Integrated reasoning

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    39/72

    Human-Object Interaction

    Tennis

    racket

    Human pose estimation

    Holistic image based classification

    Integrated reasoning

    Object detection

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    40/72

    Human-Object Interaction

    Human pose estimation

    Holistic image based classification

    Integrated reasoning

    Object detection

    Torso

    Head

    Tennis

    racket

    HOI activity: Tennis Forehand

    Slide Credit: Yao/Fei-Fei

    Action categorization

  • 7/24/2019 Lecture on Action Recognition

    41/72

    Felzenszwalb & Huttenlocher, 2005

    Ren et al, 2005

    Ramanan, 2006

    Ferrari et al, 2008

    Yang & Mori, 2008

    Andriluka et al, 2009 Eichner & Ferrari, 2009

    Difficult part

    appearance

    Self-occlusion

    Image region looks

    like a body part

    Human pose estimation & Object detection

    Human pose

    estimation is

    challenging.

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    42/72

    Human pose estimation & Object detection

    Human pose

    estimation is

    challenging.

    Felzenszwalb & Huttenlocher, 2005

    Ren et al, 2005

    Ramanan, 2006

    Ferrari et al, 2008

    Yang & Mori, 2008

    Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    43/72

    Human pose estimation & Object detection

    Facilitate

    Given the

    object is

    detected.

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    44/72

    Viola & Jones, 2001

    Lampert et al, 2008

    Divvala et al, 2009

    Vedaldi et al, 2009

    Small, low-resolution,

    partially occluded

    Image region similar

    to detection target

    Human pose estimation & Object detection

    Object

    detection is

    challenging

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    45/72

    Human pose estimation & Object detection

    Object

    detection is

    challenging

    Viola & Jones, 2001

    Lampert et al, 2008

    Divvala et al, 2009

    Vedaldi et al, 2009

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    46/72

    Human pose estimation & Object detection

    Facilitate

    Given the

    pose is

    estimated.

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    47/72

    Human pose estimation & Object detection

    Mutual Context

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    48/72

    H

    A

    Mutual Context Model Representation

    More than oneHfor eachA;

    Unobservedduring training.

    A:

    Croquet

    shot

    Volleyball

    smash

    Tennis

    forehand

    Intra-class variations

    Activity

    Object

    Human pose

    Body parts

    lP: location; P: orientation;sP: scale.

    Croquet

    malletVolleyballTennis

    racket

    O:

    H:

    P:

    f: Shape context. [Belongie et al, 2002]

    P1

    Image evidence

    fO

    f1 f2 fN

    O

    P2

    PN

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    49/72

    Activity Classification Results

    Gupta et

    al, 2009

    Our

    model

    Bag-of-

    Words

    83.3%

    C

    lassificationaccuracy

    78.9%

    52.5%

    0.9

    0.8

    0.7

    0.6

    0.5

    Cricket

    shot

    Tennis

    forehand

    Bag-of-words

    SIFT+SVM

    Gupta et

    al, 2009

    Our

    model

    Slide Credit: Yao/Fei-Fei

  • 7/24/2019 Lecture on Action Recognition

    50/72

    Todays lecture

    Introductionapplication

    Features

    Template classification methods

    Recognition using pose estimation and

    objects Part-based action localisation

    Part based Recognition&Localisation

  • 7/24/2019 Lecture on Action Recognition

    51/72

    52

    Part-based Recognition&Localisation

    Implicit shape model

    Goal:

    Recognize categories of

    actionsLocalize them in terms of their

    bounding box (space +

    time)

    Challenges:Occlusions, clutter, variations,

    Hypothesis: Analysis can be restricted on a set of

    spatiotemporally interesting/salient events

    I f ti th ti l ti l

  • 7/24/2019 Lecture on Action Recognition

    52/72

    53

    Information theoretical spatial

    saliency

    Proposal: Use signal unpredictability as anindicator of saliency

    HD=3.866

    HD=7.201

    Spatial Saliency: Unpredictability in a single frame

  • 7/24/2019 Lecture on Action Recognition

    53/72

    54

    Scale (circle radius)

    E

    ntropy

    0 20 40 60 80-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    29 59

    Towards scale invariance

    The entropy maxima reveal the spatial scale(s) of a salient

    region

    Detected salient pointsin a single frame

    Spatial and spatiotemporal

  • 7/24/2019 Lecture on Action Recognition

    54/72

    Entropy(HD)

    55

    Spatial and spatiotemporal

    saliency

    Spatiotemporal Saliency:

    Driven by signal unpredictability in a spatiotemporal

    volume (cylinder / sphere)

    Examine

    entropy:

    kkk vHvwvY

    Entropys heightEntropys

    peaknessdqudsp

    dddqudsp

    ssudsw

    q

    D

    q

    D

    ,,,,,,

    Descriptor extraction codebook

  • 7/24/2019 Lecture on Action Recognition

    55/72

    56

    Descriptor extraction codebookcreation

    Optical Flow

    after median subtraction

    Spatiotemporal

    Salient Point Detectionc1

    c2

    cN

    Codebook

    (class-specific)

    Optical FlowInput sequence

    t

    Feature ensembles

    O.Boiman & M.Irani [ICCV05]

    Feature selection

    Ensemble codewords

    Optical Flow + Spatial Gradient

    Descriptors.

    Bin in histograms and concatenat

    Class-dependent

  • 7/24/2019 Lecture on Action Recognition

    56/72

    57

    Class-dependent

    Spatio-temporal probabilistic

    voting

    Current frame

    T

    t

    -t T-t

    Parameters stored for each ensemble in the

    training set

    average spatial position of ensemble with

    respect to subject center and lower bound.

    distance in frames of the activated ensemblefrom the start/end of the action

    average spatiotemporal scale of ensemble.

    Localisation model learned for codeword/cluster :

    d

    e

    idii epcepwcp

    d

    |||

    X

    T

    S

    de

    ic

    ic

    de

    iX cp

    x

    |

    Di i i ti l i

  • 7/24/2019 Lecture on Action Recognition

    57/72

    Discriminative learning

    Higher weights for pdfs with higher

    localisation accuracy

    Class dictionary comprise of

    discriminative codewordsAdaboost on the codeword similarities

    iii cpcpdw |log|exp( icp |

    Spatio-temporal probabilistic

  • 7/24/2019 Lecture on Action Recognition

    58/72

    59

    Spatio temporal probabilisticvoting

  • 7/24/2019 Lecture on Action Recognition

    59/72

    60

    Hypothesis verification with

    RVM-based classification

    Mean-shift responses

    used as features in RVM-based classification

    Two class classification problem for class l

    Select class l that maximizes the posterior probability

    2

    2

    ( , ')

    2( , ')

    CD F F

    K F F e

    N

    ji

    l

    jl

    l

    jl FFKwwwFc ,);( 0

    ,......,,1 i

    ffF

    1;1|

    wFc leFlp

    Relevance Vector Machine (RMV) is variant of Support Vector Machine

    Localisation of single actions

  • 7/24/2019 Lecture on Action Recognition

    60/72

    61

    Localisation of single actions

  • 7/24/2019 Lecture on Action Recognition

    61/72

    Localisation accuracy (KTH)

  • 7/24/2019 Lecture on Action Recognition

    62/72

    Localisation accuracy (KTH)

    Action recognition

  • 7/24/2019 Lecture on Action Recognition

    63/72

    64

    Action recognition

    KTH datasetaverage : 88% HoHA datasetaverage : 37%

  • 7/24/2019 Lecture on Action Recognition

    64/72

    Localisation under artificial occlusions (KTH)

  • 7/24/2019 Lecture on Action Recognition

    65/72

    Localisation under clutter (KTH)

    Summary

  • 7/24/2019 Lecture on Action Recognition

    66/72

    Summary

    Advantages

    Highly flexible structure model

    Each part casts votes independently

    Only few training examples are needed

    Fast recognition

    Robustness to occlusions

    Disadvantages

    Loose spatial model that does not model co-occurence of parts.

    False positives in background (clutter)

    Take-home messages

  • 7/24/2019 Lecture on Action Recognition

    67/72

    Take-home messages

    Action recognition is an open problem.

    How to define actions? How to infer them?

    What are good visual cues?

    How do we incorporate higher level reasoning?

    Take-home messages

  • 7/24/2019 Lecture on Action Recognition

    68/72

    Take home messages

    Some work done, but it is just the beginning of

    exploring the problem. So far Actions are mainly categorical

    Most approaches are classification using simple features

    (spatial-temporal histograms of gradients or flow, s-t interest

    points, SIFT in images)

    Just a couple works on how to incorporate pose and objects Not much idea of how to reason about long-term activities or

    to describe video sequences

    To come

  • 7/24/2019 Lecture on Action Recognition

    69/72

    To come

    Recognition in image sequences

    Action recognition(body gestures / facial expressions)

    Tracking

    Structure from motion

    Surveillance

    References

  • 7/24/2019 Lecture on Action Recognition

    70/72

    References

    C. Sminchisescu. Learning and Inference Algorithms for Monocular

    Perception - Applications to Visual Object Detection, Localization and

    Time Series Models for 3D Human Motion Understanding, 2007.University of Bonn, Faculty of Mathematics and Natural Sciences.

    Habilitation Thesis.

    A. Bobick and J. Davis, The recognition of human movement using

    temporal templates, IEEE Trans. PAMI., vol. 23, pp. 257267,

    Mar 2001.

    Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik.

    Recognizing action at a distance. In Proceedings of IEEE ICCV '03 -

    Volume 2, 2003.

    P Matikainen, M Hebert, and R Sukthankar.Trajectons: Action

    recognition through the motion analysis of tracked features. In Workshop

    on Video-Oriented Object and Event Classication, ICCV 2009 Ivan Laptev. On space-time interest points. International Journal of

    Computer Vision, 64(2-3): 2005

    B. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-

    time shapes, in: ICCV, Beijing, China, Oct 1521, 2005.

    References

  • 7/24/2019 Lecture on Action Recognition

    71/72

    References

    Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:

    Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,

    pp. 1

    8. Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin

    Rozenfeld, Learning realistic human actions from movies, in:

    Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.

    Modeling Mutual Context of Object and Human Pose in Human-Object

    Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San

    Francisco, CA, USA. June 13-18, 2010.

    O. Boiman, M. Irani, Detecting irregularities in images and in video, in:

    ICCV, Beijing, China, Oct 1521, 2005.

    P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object

    Recognition,IJCV Vol. 61, No. 1, January 2005

    Deva Ramanan, Learning to parse images of articulated bodies, in:NIPS, 19, Vancouver, Canada, December 2006

    V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space

    Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.

    References

  • 7/24/2019 Lecture on Action Recognition

    72/72

    References

    Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and

    Spatial Constraints in Human Pose Estimation, ECCV, 2008

    M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: PeopleDetection and Articulated Pose Estimation In: IEEE CVPR, 2009

    M. Eichner and V. Ferrari "Better Appearance Models for Pictorial

    Structures", BMVC, London, September 2009.

    Paul A. Viola and Michael J. Jones. Rapid object detection using a

    boosted cascade of simple features. In CVPR (1), 2001.

    Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize

    Objects with Structured Output Regression", ECCV,

    Marseilles,France,2008

    A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal

    Localization and Categorization of Human Actions in

    Unsegmented Image Sequences" . IEEE Trans. Image Processing,vol. 20, no. 4, pp. 1126-1140, Mar. 2011

    Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.

    Varma, and A. Zisserman, in Proceedings of the ICCV, 2009