toward learning mixture-of-parts pictorial structures

Toward Learning Mixture-of-Parts Toward Learning Mixture-of-Parts

Pictorial StructuresPictorial Structures

Robin Hess and Alan Fern

School of Electrical Engineering and Computer Science

Oregon State University

Alan Fern Oregon State University

Overview OSU Digital Scout Project

Describe problem of initial formation labeling Representational and inference challenges

Mixture-of-Parts Pictorial Structures Model definition Inference

Opportunities for learning Parameters and structure Speedup Learning Active Learning Transfer Learning

Talk Objectives


The OSU Digital Scout ProjectObjective: compute semantic interpretations of football video

Raw video High-level interpretation of play

Professional/college teams spend many hours attaching semantic tags to video for DB access We want to make this process much more automatic

Support computer assisted strategic analysis of opponents

Previous Work: S. Intille. Visual Recognition of Multi-Agent Action. PhD Thesis, MIT, 1999.


Obtained several games worth of home field video from OSU football team Once video file per play Exact same video used by coaches Video shot by single fixed location at top of Reser stadium Camera is constantly panning and zooming

Raw Video Data


Registered Video Data Semantic interpretation requires registration of video

data to football field coordinates Developed robust registration approach [Hess & Fern, CVPR’07]

planar homography


Problem: Formation Labelling We consider a subproblem of full play interpretation

Given: initial registered video frame of a play Output: offensive formation

types and locations of 11 offensive players

Thousands of possible formations

player locations & types


Challenges in Formation Labelling Player appearances nearly identical

Appearance not useful for inferring player type

Difficult to robustly segment individual players “part detector” style approaches are difficult to apply


Challenges in Formation LabellingDifferent formations can differ in subtle ways


Problem Constraints A number of hard constraints imposed by rule book

Exactly 11 players Exactly 7 players on line and 4 players behind line Exactly 1 quarterback and 1 center Location of center is at midfield or “hash line”


Problem Constraints Soft constraints on relative spatial locations of

players Constraints strongly depend on the set of player types


Previous Attempt

Intille used KB of hard constraints to cast as a SAT-like problem Constraints: “near”, “to the left of”, “bit of vertical

space between”, etc.

Simplified problem by hand-labelling the field locations of the 11 players Only tried to infer player types

Failed to get the approach to work well and was abandoned in previous work

S. Intille. Visual Recognition of Multi-Agent Action. PhD Thesis, MIT, 1999.


Structured Output Representations

Infer type & location for all of 11 players ti {QBS, QB, C, LG, RG, LTE, . . . }, 34 types

li {(0,0),(0,1),…, (n,m)}, pixel location

Our representation must capture Hard joint constraints among types Soft joint constraints among locations

conditioned on types and image data 22 output variables

Possible to encode constraints via standard discrete factor-graph models (e.g. CRFs, weighted CSPs, ILP, etc.)

Such encodings appear problematic wrt off-the-shelf inference techiques (?)

Domains of variables are huge many valuesLarge factors (e.g. exactly 7 “line type” players)Location constraints are inherently numeric


Pictorial Structures Offensive formations can be viewed as multi-part

articulated objects (parts correspond to players)

Pictorial structure models have been successful for multi-part objects in computer vision Local part appearance models Deformable connections Joint estimation of part locations

Courtesy Fischler & Elschlager

simply pairwisegraphical models

node values are part locations


When edge structure forms a tree can use DP to compute map in O(nh2) time n - # of parts, h - # of pixels h2 is often impractical

If in addition dij(. , .) is a Mahalanobis distance then can do computation in O(nh) time!


Pictorial Structures for Football For a fixed set of player types, locations can be well

approximated by pictorial structure

But part sets (i.e. player types) varies across plays Can’t use standard pictorial structures for our problem

Can we still leverage benefits of pictorial structures?


Mixture of Parts Pictorial Structures (MoPPS)

Captures constraints on legal part sets via pv

Captures spatial constraints among parts via f


MoPPS Inference

Find MAP estimate of most likely set of parts and their locations:

Worst case: evaluate pictorial structure of each legal part set Requires over an hour of processing for our problem

Need a structured MoPPS representation that can be exploited for fast inference We use a “MoPPS Tree”


MoPPS Tree Representation

Pictorial structure for a legal part set is projection of global tree onto part set


MoPPS Tree for Football

34 parts in model (one for each possible player type)

Includes local observation models

Includes pairwise spatial constraints

Also provide constraints for evaluating legal part sets


MoPPS Tree Inference

Becomes combinatorial optimization over legal part sets

We use Branch-and-Bound Search (BBS)


Branch-and-Bound Search

Search nodes are part sets Internal nodes represent sets of legal part sets Leaves are legal part sets

While solution not found Expand least node according to ordering relation Computer upper and lower bound Prune any dominated node


Lower Bound Computations

Monotonicity: adding to a set of parts will never result in reduced cost Simply compute pictorial structure match of tree projected on parts in

search node Can improve on this by adding cost for “missing parts”


Upper Bound Computations

Match entire MoPPS tree to image data Use as a heuristic for quickly finding legal completion of current part set Cost of completion is upper bound


MoPPS Tree Parameters for Football

34 parts, 3200+ legal formations 16 basic player types plus subtypes

Connections modeled as Gaussian overideal location relative to “parent” player Parameters manually set using training images

Observation model uses two independent components : based on background model : based on color histogramming


Background Model Register lots of video to field model

Learn kernel density estimate of color at each pixel


Results


Anytime Behavior: % Correct

• Exhaustive search requires close to an hour

• Greedy search is fast but achieves only 80% accuracy

• Mean-squared location error less than a yard


Directions Learning MoPPS Models

Successfully hand-coded a MoPPS model Was quite time consuming to get parameters right Motivates supervised structure and parameter learning

MoPPS model takes average of 4 minutes per play Still too slow for weekly volume of game video Motivates speedup learning

MoPPS model will sometimes need to be relearned/adapted to different sets of video Want to reduce labelling effort Motivates active and transfer learning


Structure and Parameter Learning

Goal: learn structure and parameters of MoPPS tree from labelled data Assume hard constraints on legal part sets provided

There are algorithms for learning the structure of pictorial structures Can easily modify to learn MoPPS tree Easy to combine with generative parameter learning


Structure and Parameter Learning

Issue: pure generative parameter learning will not likely be sufficient Hand-coded model incorporate “reward terms” to make

up for deficiencies in generative observation model Suggests augmenting generative model with

discriminatively trained components

Issue: inference time of 4 minutes makes most generative training methods quite expensive Suggests using approaches that do not perform full joint

inference for each parameter update


Speedup Learning

How can we speedup branch-and-bound search? There are a number of interesting settings

Setting 1: Given a MoPPS model & upper/lower bound functions Learn an effective search space operators

Setting 2: Given a MoPPS model & search space Learn more accurate upper/lower bound functions

Setting 3: Given a MoPPS model & search space & possibly bounds Learn an effective priority queue ranking function


Active Model Calibration

Want to minimize labelling effort for new video set Active learning and/or semi-supervised

Want to leverage experience with previous videos Transfer learning

How can we combine these two paradigms for label efficient active model calibration? User interface is also critical

Very rough idea: Assume fixed model structure Learn prior on parameters from previous data sets Use prior for regularization and example selection


Summary and Future Work

New structured output challenge problem We will provide labelled data set Can off-the-shelf structured learning approaches work

Suggests investigating lesser studied directions Speedup learning Active calibration

On the horizon Applying to defensive formations Full temporal play interpretation Mining strategic knowledge Strategic planning


DigitalScout

Project

The

http://eecs.oregonstate.edu/football

toward learning mixture-of-parts pictorial structures

Documents

registration of video

field locations

initial registered video

capturehard joint constraints

typessoft joint constraints

formation labellingwe

simplified problem

line type pla