beyondnouns: exploitingprepositionsand ...fidler/slides/csc2523/arvie_week4_gupta.pdf ·...

Beyond nouns: Exploiting prepositions andcomparative adjectives for learning visual classifiers

by Abhinav Gupta & Larry S. Davispresented by Arvie Frydenlund

Paper information

I ECCV 2008I Slides at http://www.cs.cmu.edu/ ∼abhinavg/I http://www.cs.cmu.edu/%7Eabhinavg/eccv2008.ppt

Objectives of the paper

Task:I Auto-annotation of image regions to labels

Methods:I Two models learnedI Training model

I Learns classifiers for nouns and relationships at the same timeI Learns priors on possible relationships for pairs of nouns

I Inference model given the above classifiers and priorsIssues:

I Dataset is weakly labeledI Not all labels are used all the time in the dataset

Weakly labeled data

President Obama debates Mitt Romney, while the audience sitsin the background. (while the audience sits behind the debaters)

Co-occurrence AmbiguitiesOnly have images of cars that include a street

A man beside a car on the street in front of a fence.

Noun relationships

Car

Street

Street

Car

I On(Car, Street)I P(red labeling) > P(blue labeling)

Prepositions and comparative adjective

Most common prepositions:

I above, across, after, against, along, at, behind, below,beneath, beside, between, beyond, by, down, during, in, inside,into, near, off, on onto, out, outside, over

I since, till, after, before, from, past, to, around, though,thoughout

I for, except, about, like, of

Comparative adjective:

I larger, smaller, taller, heavier, faster

http://www.cs.cmu.edu/ ∼abhinavg/

Relationships Actually Used

Used 19 in totalI above, behind, beside, more textured, brighter, in, greener,

larger, left, near, far, from, ontopof, more blue, right, similar,smaller, taller, shorter

Images and regions

I Each image is pre-segmented and (weakly) annotated by a setof nouns and relations between the nouns

I Regions are represented by a feature vector based on:I Appearance (RGB, Intensity)I Shape (Convexity, Moments)

I Models for nouns are based on features of the regionsI Relationships models are based in differential features:

I Difference of average intensityI Difference of location


Egg-Chicken

I Learning models for the nouns and relationships requiresassigning labels

I Assigning labels requires some model for nouns andrelationships

I Solution is to use EM:I E: compute noun annotation assignments to labels given old

parametersI M: compute new parameters given the the E-step assignments

I Classifiers are initialized by previous automated-annotationmethods i.e. Duygulu et al., Object recognition as machinetranslation, EECV (2002)

Generative training model

I CA and CR are classifiers(models) for the nounassignments andrelationships

I Ij and Ik are region featuresfor regions j and k . Ijk arethe differential features.

I ns and np are two nouns.I r is a relationship.I L(θ) = (CA,CR)

Fig. 2 from A. Gupta & L.S. Davis

Training

I Too expensive to evaluate L(θ) directlyI Use EM to estimate L(θ), with assignments as hidden values.I Assume predicates are independent given image and

assignmentI Obviously wrong, since most predicates preclude othersI Can’t be ‘on top of’ and ‘beside’

Training relationships modelled

I CA, noun model, is implemented as a nearest neighbour basedlikelihood model

I CR , relationship mode, is implemented as a decision stumpbased likelihood model

I Most relationships are modelled correctlyI A few were not

I In: ‘Not captured by colour, shape, and location’(?)I on-top-ofI taller due to poor segmentation algorithm


Inference model

I Given trained CA and CR

from the above modelI Find

P(n1, n2, ...|I1, I2, ...,CA,CR)

I Each region represented by anoun node

I Edges between nodes areweighted by the likelihoodobtained by differentialfeatures

Fig. 3 from A. Gupta & L.S. Davis

Experimental setup

I Corel5K datasetI 850 training images, tagged with nouns and manually labeled

relationshipsI Vocabulary size 173 nouns, 19 relationshipsI Same segmentation and feature vectors as Duygulu et al.,

Object recognition as machine translation, EECV (2002)I Training model test set 150 images (from training set)I Inference model test set 100 images (given that those images

have the same vocabulary)


Training model evaluationI Use two metrics:

I Range semantics: counts number of correctly labeled words,while treating each label with the same weight

I Frequency counts: counts number of correctly labeled regions,which weights more frequent words heigher

I Compared to simple IBM1 (MT model, 1993) and Duygulu etal., MT model


Inference model evaluationI Annotating unseen imagesI Doesn’t use Corel annotations due to missing labelsI 24% and 17% reduction in missed labelsI 63% and 59% reduction false labels


Inference model examples

Duygulu et al. is the top and the paper’s results are the bottom

Inference model Precision-Recall

Duygulu et al. is [1]

Novelties and limitations

Achievements:I Novel use of prepositions and comparative adjectives for

automatic annotationI Use previous annotation models for bootstrappingI Good results

Limitations:I Only uses two argument predicates, results in ‘greener’I Can’t do pink flower exampleI Assumes one to one relationship between nouns and image

segments

Questions?

I One of the motivations was the co-occurrence problem.Wouldn’t a simpler model with better training data solve thisproblem?

I Image caption generation to annotation stack?I Model simplification: assuming independence of predicates?I Scale with vocabulary and number of relationships used?

‘Bluer’ and ‘greener’ work for outdoor scenes

beyondnouns: exploitingprepositionsand ...fidler/slides/csc2523/arvie_week4_gupta.pdf ·...

Documents