beyondnouns: exploitingprepositionsand ...fidler/slides/csc2523/arvie_week4_gupta.pdf ·...
TRANSCRIPT
Beyond nouns: Exploiting prepositions andcomparative adjectives for learning visual classifiers
by Abhinav Gupta & Larry S. Davispresented by Arvie Frydenlund
Paper information
I ECCV 2008I Slides at http://www.cs.cmu.edu/ ∼abhinavg/I http://www.cs.cmu.edu/%7Eabhinavg/eccv2008.ppt
Objectives of the paper
Task:I Auto-annotation of image regions to labels
Methods:I Two models learnedI Training model
I Learns classifiers for nouns and relationships at the same timeI Learns priors on possible relationships for pairs of nouns
I Inference model given the above classifiers and priorsIssues:
I Dataset is weakly labeledI Not all labels are used all the time in the dataset
Weakly labeled data
President Obama debates Mitt Romney, while the audience sitsin the background. (while the audience sits behind the debaters)
Co-occurrence AmbiguitiesOnly have images of cars that include a street
A man beside a car on the street in front of a fence.
Noun relationships
Car
Street
Street
Car
I On(Car, Street)I P(red labeling) > P(blue labeling)
Prepositions and comparative adjective
Most common prepositions:
I above, across, after, against, along, at, behind, below,beneath, beside, between, beyond, by, down, during, in, inside,into, near, off, on onto, out, outside, over
I since, till, after, before, from, past, to, around, though,thoughout
I for, except, about, like, of
Comparative adjective:
I larger, smaller, taller, heavier, faster
http://www.cs.cmu.edu/ ∼abhinavg/
Relationships Actually Used
Used 19 in totalI above, behind, beside, more textured, brighter, in, greener,
larger, left, near, far, from, ontopof, more blue, right, similar,smaller, taller, shorter
Images and regions
I Each image is pre-segmented and (weakly) annotated by a setof nouns and relations between the nouns
I Regions are represented by a feature vector based on:I Appearance (RGB, Intensity)I Shape (Convexity, Moments)
I Models for nouns are based on features of the regionsI Relationships models are based in differential features:
I Difference of average intensityI Difference of location
http://www.cs.cmu.edu/ ∼abhinavg/
Egg-Chicken
I Learning models for the nouns and relationships requiresassigning labels
I Assigning labels requires some model for nouns andrelationships
I Solution is to use EM:I E: compute noun annotation assignments to labels given old
parametersI M: compute new parameters given the the E-step assignments
I Classifiers are initialized by previous automated-annotationmethods i.e. Duygulu et al., Object recognition as machinetranslation, EECV (2002)
Generative training model
I CA and CR are classifiers(models) for the nounassignments andrelationships
I Ij and Ik are region featuresfor regions j and k . Ijk arethe differential features.
I ns and np are two nouns.I r is a relationship.I L(θ) = (CA,CR)
Fig. 2 from A. Gupta & L.S. Davis
Training
I Too expensive to evaluate L(θ) directlyI Use EM to estimate L(θ), with assignments as hidden values.I Assume predicates are independent given image and
assignmentI Obviously wrong, since most predicates preclude othersI Can’t be ‘on top of’ and ‘beside’
Training relationships modelled
I CA, noun model, is implemented as a nearest neighbour basedlikelihood model
I CR , relationship mode, is implemented as a decision stumpbased likelihood model
I Most relationships are modelled correctlyI A few were not
I In: ‘Not captured by colour, shape, and location’(?)I on-top-ofI taller due to poor segmentation algorithm
http://www.cs.cmu.edu/ ∼abhinavg/
Inference model
I Given trained CA and CR
from the above modelI Find
P(n1, n2, ...|I1, I2, ...,CA,CR)
I Each region represented by anoun node
I Edges between nodes areweighted by the likelihoodobtained by differentialfeatures
Fig. 3 from A. Gupta & L.S. Davis
Experimental setup
I Corel5K datasetI 850 training images, tagged with nouns and manually labeled
relationshipsI Vocabulary size 173 nouns, 19 relationshipsI Same segmentation and feature vectors as Duygulu et al.,
Object recognition as machine translation, EECV (2002)I Training model test set 150 images (from training set)I Inference model test set 100 images (given that those images
have the same vocabulary)
http://www.cs.cmu.edu/ ∼abhinavg/
Training model evaluationI Use two metrics:
I Range semantics: counts number of correctly labeled words,while treating each label with the same weight
I Frequency counts: counts number of correctly labeled regions,which weights more frequent words heigher
I Compared to simple IBM1 (MT model, 1993) and Duygulu etal., MT model
http://www.cs.cmu.edu/ ∼abhinavg/
Inference model evaluationI Annotating unseen imagesI Doesn’t use Corel annotations due to missing labelsI 24% and 17% reduction in missed labelsI 63% and 59% reduction false labels
http://www.cs.cmu.edu/ ∼abhinavg/
Inference model examples
Duygulu et al. is the top and the paper’s results are the bottom
Inference model Precision-Recall
Duygulu et al. is [1]
Novelties and limitations
Achievements:I Novel use of prepositions and comparative adjectives for
automatic annotationI Use previous annotation models for bootstrappingI Good results
Limitations:I Only uses two argument predicates, results in ‘greener’I Can’t do pink flower exampleI Assumes one to one relationship between nouns and image
segments
Questions?
I One of the motivations was the co-occurrence problem.Wouldn’t a simpler model with better training data solve thisproblem?
I Image caption generation to annotation stack?I Model simplification: assuming independence of predicates?I Scale with vocabulary and number of relationships used?
‘Bluer’ and ‘greener’ work for outdoor scenes