towards recognition of everyday objectscharm high scratch resistance anti- allergenic...

Post on 18-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Towards recognition of everyday objects

Prof. Trevor Darrell

Where we are…

• Great progress in categorical object recognition in computer vision community– advances on CT101, PASCAL, etc.

• Great progress on Robotic sensing, esp., mapping, navigation, etc– SLAM, Grand Challenge, etc

• Yet, broad category-level robotic object recognition in general environments is still nearly nonexistant!

CV: faces and instances

Aachen Cathedral

snaptell.com

like.com

Robotics: Mapping, Navigation…

A modest proposal…• A robot that can recognize / find every object in my

office / kitchen / kid’s playroom?– without requiring a grad student to collect multi-view

training data of each object?

• I don’t think we are on the path to solve this with conventional SIFT + fancy kernels + sup. learning…– even with LabelMe, ImageNet, MechTurk…

• It will likely involve:– multiple sensing modalities (“views”) and semi-supervised

learning (both manifold learning and co-training flavors)

– local features that respect physics of image formation

– active learning at training, and attentive learning at test….

– limited “natural”

interaction with a user

Computer Vision vs

Robotic Vision: Divergent Paradigms?

Computer Vision: – Machine Learning paradigm

– Whoever has the largest dataset wins.

– Leads to least common denominator in terms of features to use

category recognition focus;

weak features…

Robotic Vision: – Sensing paradigm; sensors are

cheap; add them! …3-D sensing, multi-spectral, ultra-

high-res….

– Whoever has the most sensors wins…

– But then we can only get training data from our environment (in situ)!

strong features;

instance recognition focus…

Which one is right?

• Both!

• Neither!

• Key technical problems for next generation robotic visual recognition systems: how to…– bridge category and instance level learning

– fully leverage scene and task context

– simultaneously exploit labeled online data and unlabeled or weakly labeled in-situ data

• Robotics will drive next generation of object recognition challenges….

Rough evolution of visual object recognition research

1970s/80s 2000s1990s 2010+

Get my bag…

?• hierarchical labels• scene/task context• multimodal• interactive• robotic…

• Fusing multiple cues and discovering shared representations across categories…

• Visual Sense Disambiguation…

• Transparent local features…

Recent Progress: Combing Features, Overcoming ambiguity

vs vs

?

features categories

Multiple Cues and Context

Many Local Representations…

Superpixels

[Ren

et al.]

Shape context [Belongie et al.]

Maximally Stable Extremal Regions [Matas et al.]

Geometric Blur [Berg et al.]

SIFT [Lowe]

Salient regions [Kadir et al.]

Harris-Affine [Schmid et al.]

Spin images [Johnson and

Hebert]

Wide variety of proposed local feature representations:

How to Compare Sets of Features?• Each feature type yields set of vectors per image• Unordered, varying number of vectors per instance

?

Pyramid Match

optimal partial matching

for sets with features of dimension

Optimal matching

Greedy matching

Pyramid match

[Grauman and Darrell, ICCV 2005, JMLR 2007]

Cue Combination

• Feature / Kernel combination is very active topic– SVM MKL schemes (Varma, et al.)

– Cross-Validation approaches (recent ICCV papers…)

– Naïve Bayes Nearest Neighbor schemes (Irani)

• Our Gaussian Process formulation has significant efficiency advantage when using 1 vs. all formulation– most of computation is inherently shared across categories

• Good news: significant accuracy improvement!

• Bad news: combination methods all perform about the same, at least on Caltech datasets…– but does help when there are non-informative kernels…

[ See http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html

for details…

]

[Kapoor, Urtasun, Grauman, Darrell, IJCV, to appear]

Combining 2D and 3D

• How to best exploit 3-D sensors?– very hard to get good local shape estimates

• Exploit 3-D sensing as context for 2-D reco.

• 3-D scene context– 3D can provide summary of overall environment

– find support surfaces

• Estimate and exploit absolute size

constraints– model size variation of overall object and

local

patches based on training data or external knowledge

Indoor Support surfaces

Table surface extraction from 3-D scene data:

Indoor search constraints

• Use surface

and size constraints…

Notion of Absolute Size

• Absolute feature size

• Absolute object size

Vs.

Vs.

Branch & Bound for Fast Detection

[Lampert]

•Feature computation•Codebook matching•Discriminative weights•Sliding window•Detections

Object Detection

• Exhaustive search is costly –

in particular

for 3d• Previous work on

branch&bound provides speed up for 2d

• Use upper bounds on bounding box intervals

A

l

l

p

o

s

s

i

b

l

2-42-4 3-53-5

2-2.52-2.5 3-43-4X

Current Detection Demo Results

Joint Category Learning

Standard “1 vs. all”

paradigm….

SVM/GPC – Category 1

SVM/GPC – Category 2

SVM/GPC – Category 3

SVM/GPC – Category 4

SVM/GPC – Category 256

SVM/GPC – Category 10,000?

How to exploit shared structure?

Consider ensemble of classifiers

SVM/GPC – Category 1 w1

SVM/GPC – Category 2 w2

SVM/GPC – Category 3 w3

SVM/GPC – Category 4 w4

SVM/GPC – Category 256

SVM/GPC – Category w10,000

classifier weights

Consider ensemble of classifiers

SVM/GPC – Category 1 w1

SVM/GPC – Category 2 w2

SVM/GPC – Category 256

SVM/GPC – Category wn

W = [ w1

w2

… wn

]

Related tasks and/or object part structure will lead to correlated patterns in W…[Quattoni, Collins, Darrell, CVPR

2007] explore Ando+Zhang style structure learning for scene recognition tasks.

Learn W jointly? [Quattoni, Collins, Darrell, CVPR

2008] explore joint spare optimization via matrix norm penalty.

[Quattoni, Carreras, Collins, Darrell, ICML 2009] report an efficient learning scheme for this approach…

Joint Sparse Approximation

xwxf )(

Dyx

d

jjwCyxfl

),( 1||)),((min

w

• Consider learning a single sparse linear classifier of the form:

That is, we want only a few features with non-zero coefficients

• L1 regularization well-known to yield sparse solutions:

Classificationerror

L1 penalizesnon-sparse solutions

Joint Sparse Approximation

m

kmk

Dyxk

CyxflD

k121

),(,...,, ),....,,R()),((

||1min www

m21 www

Optimization over several tasks jointly:

Average Losson training set k

penalizes solutions that

utilize too many features

xxf kk w)(

Key idea: use a matrix norm…[Obozinski et al. 2006, Argyriou et al. 2006, Amit et al. 2007 ]

Joint Regularization Penalty

rowszerononW #)R(

mddd

m

m

WWW

WWWWWW

,2,1,

,22,21,2

,12,11,1

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients forclassifier 2

Would lead to a hard combinatorial problem .

Joint Regularization PenaltyWe use a L1-∞

norm [Tropp 2006]

d

iikk

WW1

|)(|max)R(

The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.

This norm combines:

An L1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity.

Use few features

The L∞

norm on each row promotes non- sparsity on the rows. Share

features

Joint Sparse Approximation

m

k

d

iikkk

Dyxk

WCyxflD

k1 1),(|)(|max)),((

||1minW

Using the L1-∞

norm we can rewrite our objective function as:

For any convex loss this is a convex objective.

For the hinge loss the optimization problem can be expressed as a linear program. [Quattoni et al. CVPR 2008]

See also [Quattoni et al ICML 2009] for efficient large scale solutions.

News Image Classification Experiments

15 30 60 120 2400.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

# training examples per task

Mea

n E

ER

Reuters Dataset Results

L2L1L1-INF

Absolute Weights L1

Feat

ure

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

10

20

30

40

50

60

Absolute Weights L1-INF

Fea

ture

task

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

0.01

0.02

0.03

0.04

0.05

0.06

SuperBowlDanish

CartoonsSharonAustralian

openTrapped

coal miners

Goldenglobes Grammys Figure

skating AcademyAwards Iraq

L1,∞L1

Visual Sense Disambiguation

38

Goal: Object recognition in situated environments

Imagine using natural dialogue to instantiate object models in a robot

That’s a cat over there…

This is one of my purses.

There’s a lamp…

39

Speech, image can be complementary…

a pan...

That’s a pen!

Copy machine..

ant →

fanface →

basspiano →

cannon

40

Using image to aid speech recognition

Objectrecognition

41

Experiments on Caltech101•

Asked users to speak the object name, added noise

Showed benefit from fusion at all noise levels

ant →

fanface →

basspiano →

cannon

42

Large Object Vocabulary?

Objectrecognition

43

Large Object Vocabulary?

44

Category Discovery: “Watch”…

45

Sources of visual polysemy

Would rather watch… Suicide watch

Hurricane, tornado watch

Watch out!

Celebrity watch

46

Taking advantage of text contexts

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-

allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch

solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black

wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy

disclaimer copyright icrystal pty website

47

Dictionary model•

Use entry text to learn a probability distribution over words for that sense

Problem: entries contain very little text–

Expand by adding synonyms, hyponyms, 1st-level hypernyms–

Still, very few words are covered!

•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym

•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)

•direct hypernym / inherited hypernym / sister term•S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)

48

Topic space

Idea:

use large collection of unlabeled text

to learn hidden topics which align with different senses/uses of the word

live mice petrodent ear price pest

house bait old animal human tube need tail species gene head breed body love color care

friend wood cat weight white water

. . .

49

Learning visual senses: overview

Search Engine Watch Search Engine Watch

is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...

searchenginewatch.com/ -

38k -

Cached - Similar pages - Note this watch - MDCWatches

for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...

developer.mozilla.org/en/Core_Java

Script_1.5_Reference/Global_Objec

ts/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine Watch Search Engine Watch

is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...

searchenginewatch.com/ -

38k -

Cached - Similar pages - Note this watch - MDC

Watches

for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...

developer.mozilla.org/en/Core_Java

Script_1.5_Reference/Global_Objec

ts/Object/watch - 30k - Cached - Similar pages - Note this

dictionary definitions

latent topic space

unlabeled text

dictionary model P( sense | page)

unlabeled images+text training images

visual senseclassifier

Presenter
Presentation Notes
Given a word, learn a probabilistic model of each sense as defined by dictionary Use text model to construct sense-specific image classifiers Assumptions: only noun entries, one sense per image

50

Clustering + Word Sense Model

Object Sense: drinking container

Abstract Sense: sporting event

Object Sense: loving cup (trophy)

Search Word: “cup”

Online Dictionary

Word to search for:Noun

cup Search Dictionary

• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"

• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”

• a major sporting event or competition “the world cup”, “the Stanley cup”

Filtering Abstract Senses

51

Concrete vs. abstract senses

Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad) "a mouse takes much more room than a trackball"

• How can we determine if a sense is concrete or abstract?– Use a natural language processing method to learn classifier– Use existing dictionary information: e.g. WordNet’s lexical file tags

52

Filtering visual senses

Yahoo Search: “fork” DICTIONARY

1: (n) fork

(cutlery used for serving and eating food) 2: (n) branching, ramification, fork, forking (the act of branching out or dividing into branches) 3: (n) fork, crotch (the region of the angle formed by the junction of two branches) "they took the south fork"; "he climbed into the crotch of a tree"4: (n) fork

(an agricultural tool used for lifting or digging; has a handle and metal prongs) 5: (n) crotch, fork

(the angle formed by the inner sides of the legs where they join the human trunk)

53

Filtering visual senses

Artifact sense of “fork” DICTIONARY

1: (n) fork

(cutlery used for serving and eating food) 2: (n) branching, ramification, fork, forking (the act of branching out or dividing into branches) 3: (n) fork, crotch (the region of the angle formed by the junction of two branches) "they took the south fork"; "he climbed into the crotch of a tree"4: (n) fork

(an agricultural tool used for lifting or digging; has a handle and metal prongs) 5: (n) crotch, fork

(the angle formed by the inner sides of the legs where they join the human trunk)

54

Filtering visual senses

Yahoo Search: “telephone” DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

55

Filtering visual senses

Artifact sense: “telephone” DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

56

Topic adaptation•

Original LDA topics are learned on text-only unlabeled data

Adapt to image-text data via semi-supervised Gibbs sampling

E.g.: one of “fork” topics:

product bike null tool tube seal set price oil knife spoon spring

ship use item accessory handle shop order

remove store custom home weight steel supply cap clamp fit

false . . .

cutlery knife spoon product set price handle

steel tool item stainless null bike tube seal oil knive

kitchen utensil ship order use table sp ring supply design piece carve weight shop

. . .

57

“fork”: using original topics

fork liftroad forkbike fork

etc.

58

“fork”: using adapted topics

knifespooncutlery

bike forketc.

59

Work in progress…

Showed that combining speech and image classifiers improves object reference resolution

Proposed an unsupervised method to learn sense-specific object models from web text and image data

Integrated large scale demo forthcoming

See Kate’s NIPS 2009 & 2008 papers…

Transparent Local Features

Dealing with Transparency

Motivation

Transparent objects made out of glass or plastic are ubiquitous in domestic environments

Traditional local feature approach inappropriate

Full physical model intractable

Local Additive Feature Model

Significant variation in patch appearance

... but common latent structure

new LDA-SIFT model

LDA-SIFT

Transparent Visual Words

For each patch we infer the latent mixture activations that characterize the additive structure

We model the glass by learning a spatial layout of discrete “transparent local feature”

activations

Transparent Visual Words

Latent component Average occurrenceon train Occurrences on test

Recognition Architecture

LDA

XY

T

Classifier

glas

sba

ckgr

ound

X

T

Y

Results: general vocabulary

• Training on 4 different glasses in front of screen

• Testing on 49 glass instances in home environment

• Sliding window linear SVM-detection

Recognition Architecture

sLDA

XY

T

Classifier

glas

sba

ckgr

ound

X

T

Y

Results: sLDA

• Training on 4 different glasses in front of screen

• Testing on 49 glass instances in home environment

• Sliding window linear SVM-detection

Conclusion

Traditional local feature models (VQ, NN) are poorly suited for transparent object recognition

Proposed additive local feature models can detect superimposed image structures

Developed statistical approach to learn such representations using probabilistic topic models

Sparse factorization of local gradient statistics•

Encouraging results on real-world data

Future Work

Different feature representations; extend model in hierarchical fashion

Investigate addition of material property cues; discriminative inverse local light transport models

Explore benefits for opaque object recognition; understand relationship to sparse image coding as well as to biological motivated models

• Fusing multiple cues and discovering shared representations across categories…

• Visual Sense Disambiguation…

• Transparent local features…

Recent Progress: Combing Features, Overcoming ambiguity

vs vs

?

features categories

For more information…

Probabilistic multi-kernel fusion–

Christhoudias, Urtasun, Darrell, CVPR 2009

Joint regularization across categories–

Quattoni, Carreras, Collins, Darrell,

ICML 2009.

Machine Learning for Multimodal sense grounding–

Saenko

and Darrell, NIPS 2008, NIPS 2009

Local feature models for transparent objects–

Fritz, Bradski, Black, and Darrell, NIPS 2009

top related