look-ahead before you leap: end-to-end active recognition by...

1
Active recognition setting Difficulty: unconstrained visual input Opportunity: Not restricted to a single snapshot. Strategically acquiring new views. passive 1-view setting active setting mug? bowl? pan? mug pan mug? bowl? pan? mug Perception Perception Action selection Evidence fusion Perception Action selection Evidence fusion JOINT TRAINING Look-ahead FORECASTING THE EFFECTS OF ACTIONS Our idea: Multi-task joint training of components for active recognition + auxiliary internally supervised “look-ahead” task. Components of the active recognition pipeline Our idea [Wilkes 1992, Dickinson 1997, Borotschnig 1998, Schiele 1998, Denzler 2002, Soatto 2009, Ramanathan 2011, Aloimonos 2011, Borotschnig 2011, Wu 2015, Malmir 2015, Johns 2016, …] Prior art: independent, often heuristic components Closely intertwined perception, action and fusion components High-level active RNN system architecture SENSOR AGGREGATOR ACTOR CLASSIFIER LOOK-AHEAD SENSOR AGGREGATOR ACTOR CLASSIFIER LOOK-AHEAD Time Time +1 SUN 360 panoramas GERMS toy manipulation [Xiao 2012] [Malmir 2015] ModelNet-10 3-D models [Wu 2015] SCENE CATEGORIES HAND-HELD OBJECTS 3-D CAD MODELS Towards real-world active recognition Complex real-world categories + easily benchmarkable setups. Component module architectures SENSOR AGGREGATOR ACTOR CLASSIFIER LOOK-AHEAD “ Elev: 30°pose image SENSOR + DELAY 1 step History AGGREGATOR Look-ahead 2 error Time −1 Time action aggregate feature LOOK-AHEAD sample action ACTOR 1 2 3 4 5 1 2 3 4 predicted class likelihoods CLASSIFIER 5 End-to-end joint training with gradient descent + REINFORCE 35 40 45 50 55 60 65 1 2 3 4 5 Accuracy #views SUN 360 Passive neural net Transinformation [Schiele98] SeqDP [Denzler03] Transinformation+SeqDP Ours 75 80 85 90 95 1 3 6 #views ModelNet-10 Passive neural net ShapeNets [Wu15] Pairwise [Johns 16] Ours 25 30 35 40 45 50 1 2 3 #views GERMS Passive neural net Transinformation [Schiele98] SeqDP[Denzler03] Transinformation+SeqDP Ours Our method strongly outperforms representative traditional active recognition approaches on all tasks. Selected view sequence examples Top guess Top guess Top guess GERMS Toy instances SUN 360 Scene categories Quantitative evaluation results Ablation studies 40 45 50 55 60 65 1 2 3 4 5 #views SUN 360 40 42 44 46 48 1 2 3 #views GERMS Random views (avg) Random views (recurrent) Training all 3 components jointly is most critical to performance. Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion Dinesh Jayaraman and Kristen Grauman UT Austin

Upload: others

Post on 19-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Look-ahead before you leap: End-to-end active recognition by …vision.cs.utexas.edu/projects/lookahead_active/eccv16... · 2016. 10. 17. · 2011, Aloimonos 2011, Borotschnig 2011,

Active recognition setting

Difficulty: unconstrained visual input

Opportunity:

• Not restricted to a single snapshot.

• Strategically acquiring new views.

passive 1-view setting active setting

mug?bowl?pan?

mug

pan

mug?bowl?pan?

mug

Perception Perception

Action selection

Evidence fusion

Perception Action selection Evidence fusion

JOIN

T TR

AIN

ING

Look-ahead

FORECASTING THE EFFECTS OF ACTIONS

Our idea: Multi-task joint training of components for active recognition + auxiliary internally supervised “look-ahead” task.

Components of the active recognition pipeline

Our idea

[Wilkes 1992, Dickinson 1997, Borotschnig 1998, Schiele 1998, Denzler 2002, Soatto 2009, Ramanathan2011, Aloimonos 2011, Borotschnig 2011, Wu 2015, Malmir 2015, Johns 2016, …]

Prior art: independent, often heuristic components

Closely intertwined perception, action and fusion components

High-level active RNN system architecture

SENSOR

AGGREGATOR

ACTOR

CLASSIFIER

LOOK-AHEAD

SENSOR

AGGREGATOR

ACTOR

CLASSIFIER

LOOK-AHEAD

Time 𝑡 Time 𝑡+1

SUN 360 panoramas

GERMS toy manipulation

[Xia

o 2

01

2]

[Mal

mir

20

15

]

ModelNet-10 3-D models [Wu

20

15

]

SCENE CATEGORIES

HAND-HELDOBJECTS

3-D CADMODELS

Towards real-world active recognition

Complex real-world categories + easily benchmarkable setups.

Component module architectures

SENSOR

AGGREGATOR

ACTOR

CLASSIFIER

LOOK-AHEAD

“ El

ev: 30°”

pose image SEN

SOR

+ DELAY 1 step

History AG

GR

EGA

TOR

Look-ahead ℓ2 error

Time 𝑡 − 1 Time 𝑡

action aggregate feature

LOO

K-A

HEA

D sample action

AC

TOR

1

2

3

4

5

12

3

4

predicted class likelihoods

CLA

SSIF

IER5

End-to-end joint training with gradient descent + REINFORCE

35404550556065

1 2 3 4 5

Acc

ura

cy

#views

SUN 360

Passive neural netTransinformation [Schiele98]SeqDP [Denzler03]Transinformation+SeqDPOurs

75

80

85

90

95

1 3 6

#views

ModelNet-10

Passive neural net

ShapeNets [Wu15]

Pairwise [Johns 16]

Ours

25

30

35

40

45

50

1 2 3

#views

GERMS

Passive neural netTransinformation [Schiele98]SeqDP[Denzler03]Transinformation+SeqDPOurs

Our method strongly outperforms representative traditional active recognition approaches on all tasks.

Selected view sequence examples

Top guess Top guess Top guess

GER

MS

Toy

inst

ance

s

SUN

36

0Sc

ene

cat

ego

ries

Quantitative evaluation results

Ablation studies

40

45

50

55

60

65

1 2 3 4 5

#views

SUN 360

40

42

44

46

48

1 2 3

#views

GERMS

Random views(avg)

Random views(recurrent)

Training all 3 components jointly is most critical to performance.

Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion

Dinesh Jayaraman and Kristen GraumanUT Austin