sidekick policy learning for active visual...

Sidekick Policy Learning

for Active Visual Exploration

Santhosh K. Ramakrishnan1 and Kristen Grauman2

1 The University of Texas at Austin, Austin, TX 787122 Facebook AI Research, 300 W. Sixth St. Austin, TX 78701

[email protected], [email protected]⋆

Abstract. We consider an active visual exploration scenario, where anagent must intelligently select its camera motions to efficiently recon-struct the full environment from only a limited set of narrow field-of-view glimpses. While the agent has full observability of the environmentduring training, it has only partial observability once deployed, beingconstrained by what portions it has seen and what camera motions arepermissible. We introduce sidekick policy learning to capitalize on thisimbalance of observability. The main idea is a preparatory learning phasethat attempts simplified versions of the eventual exploration task, thenguides the agent via reward shaping or initial policy supervision. Tosupport interpretation of the resulting policies, we also develop a novelpolicy visualization technique. Results on active visual exploration taskswith 360◦ scenes and 3D objects show that sidekicks consistently im-prove performance and convergence rates over existing methods. Code,data and demos are available 3.

Keywords: Visual Exploration · Reinforcement Learning

1 Introduction

Visual recognition has witnessed dramatic successes in recent years. Fueled bybenchmarks composed of Web photos, the focus has been inferring semanticlabels from human-captured images—whether classifying scenes, detecting ob-jects, or recognizing activities [51,41,57]. By relying on human-taken images, thecommon assumption is that an intelligent agent will have already decided whereand how to capture the input views. While sufficient for handling static reposito-ries of photos (e.g., auto-tagging Web photos and videos), assuming informativeobservations glosses over a very real hurdle for embodied vision systems.

A resurgence of interest in perception tied to action takes aim at that hurdle.In particular, recent work explores agents that optimize their physical movementsto achieve a specific perception goal, e.g., for active recognition [43,29,31,2,28],visual exploration [30], object manipulation [40,49,46], or navigation [70,21,2].In any such setting, deep reinforcement learning (RL) is a promising approach.

⋆ On leave from University of Texas at Austin ([email protected])3 Project website: http://vision.cs.utexas.edu/projects/sidekicks/

http://vision.cs.utexas.edu/projects/sidekicks/

2 S. Ramakrishnan and K. Grauman

Environment

History

Object completion

Active view selection

History

Environment

Active view selection

Scene completion

Full Observability

Partial Observability

Full Observability

Partial Observability

Fig. 1: Embodied agents that actively explore novel objects (left) or 360◦ environments (right) intel-ligently select camera motions to gain as much information as possible with very few glimpses. Whilethey naturally face limited observability of the environment, during learning fuller observability maybe available. We propose sidekicks to guide policy learning for active visual exploration.

The goal is to learn a policy that dictates the best action for the given state,thereby integrating sequential control decisions with visual perception.

However, costly exploration stages and partial state observability are well-known impediments to RL. In particular, an active visual agent [70,21,71,30]has to take a long series of actions purely based on the limited informationavailable from its first person view. Due to poor action selection based on limitedinformation, the most effective viewpoint trajectories are buried among manymediocre ones, impeding the agent’s exploration in complex state-action spaces.

We observe that agents lacking full observability when deployed may nonethe-less possess full observability during training, in some cases. Overall, the imbal-ance occurs naturally when an agent is trained with a broader array of sensorsthan available at test-time, or trained free of the hard time pressures that limittest-time exploration. In particular, as we will examine in this work, once de-ployed, an active exploration agent can only move the camera to “look-around”nearby [30], yet if trained with omnidirectional panoramas, could access anypossible viewpoint while learning. Similarly, an active object recognition sys-tem [29,31,2,65,28] can only see its previously selected views of the object; yetif trained with CAD models, it could observe all possible views while learn-ing. Additionally, agents can have access to multiple sensors during training insimulation environments [13,48,10], yet operate on first-person observations dur-ing test-time. However, existing methods restrict the agent to the same partialobservability during training [65,31,29,30,70,28].

We propose to leverage the imbalance of observability. To this end, we in-troduce sidekick policy learning. We use the name “sidekick” to signify how asidekick to a hero (e.g., in a comic or movie) provides alternate points of view,knowledge, and skills that the hero does not have. In contrast to an expert [19,61],a sidekick complements the hero (agent), yet cannot solve the main task at hand.

We propose two sidekick variants. Both use access to the full state duringa preparatory training period to facilitate the agent’s ultimate learning task.The first sidekick previews individual states, estimates their value, and shapesrewards to the agent for visiting valuable states during training. The secondsidekick provides initial supervision via trajectory selections to accelerate the

Sidekick Policy Learning for Active Visual Exploration 3

agent’s training, while gradually permitting the agent to act on its own. In bothcases, the sidekicks learn to solve simplified versions of the main task with fullobservability, and use insights from those solutions to aid the training of theagent. At test time, the agent has to act without the sidekick.

We validate sidekick policy learning for active visual exploration [30]. Theagent enters a novel environment and must select a sequence of camera motionsto rapidly understand its entire surroundings. For example, an agent that has ex-plored various grocery stores should enter a new one and, with a couple glimpses,1) conjure a belief state for where different objects are located, then 2) directits camera to flesh out the harder-to-predict objects and contexts. The task islike active recognition [65,31,29,2], except that the training signal is pixelwise

reconstruction error for the full environment rather than labeling error. Our side-kicks can look at any part of the environment in any sequence during training,whereas the actual agent is limited to physically feasible camera motions andsees only those views it has selected. On two standard datasets [66,65], we showhow sidekicks accelerate training and promote better look around policies.

As a secondary contribution, we present a novel policy visualization tech-nique. Our approach takes the learned policy as input, and displays a sequenceof heatmaps showing regions of the environment most responsible for the agent’sselected actions. The resulting visualizations help illustrate how sidekick policylearning differs from traditional training.

2 Related Work

Active vision and attention: Linking intelligent control strategies to percep-tion has early foundations in the field [1,6,5,63]. Recent work explores new strate-gies for active object recognition [65,31,29,2,28], object localization [9,20,71], andvisual SLAM [32,58], in order to minimize the number of sampled views requiredto perform accurate recognition or reconstruction. Our work is complementaryto any of the above: sidekick policy learning is a means to accelerate and improveactive perception when observability is greater during training.

Models of saliency and attention allow a system to prioritize portions ofits observation to reduce clutter or save computation [42,4,45,68,67]. However,unlike both our work and the active methods above, they assume full observ-ability at test time, selecting among already-observed regions. Work in activesensor placement aims to place sensors in an environment to maximize cover-

age [11,36,62]. We introduce a model for coverage in our policy learning solution(Sec. 3.3). However, rather than place and fix N static sensors, the visual explo-ration tasks entail selecting new observations dynamically and in sequence.

Supervised learning with observability imbalance: Prior work in super-vised learning investigates ways to leverage greater observability during training,despite more limited observability during test time. Methods for depth estima-tion [22,16,60] and/or semantic segmentation [56,25,26] use RGBD depth data,multiple views, and/or auxiliary annotations during training, then proceed withsingle image observations at test time. Similarly, self-supervised losses [44,27]


based on auxiliary prediction tasks at training time have been used to aid repre-sentation learning for control tasks. Knowledge distillation [24] lets a “teacher”network guide a “student” with the motivaton of network compression. In learn-ing with privileged information, an “expert” provides the student with trainingdata having extra information (unavailable during testing) [61,53,37]. At a highlevel, all the above methods relate to ours in that a simpler learning task facili-tates a harder one. However, in strong contrast, they tackle supervised classifica-tion/regression/representation learning, whereas our goal is to learn a policy forselecting actions. Accordingly, we develop a very different strategy—introducingrewards and trajectory suggestions—rather than auxiliary labels/modalities.

Guiding policy learning: There is a wide body of work aimed at addressingsparse rewards and partial observability. Several works explore reward shaping

motivated by different factors. The intrinsic motivation literature develops par-allel reward mechanisms, e.g., based on surprise [47,7], to direct exploration.The TAMER framework [33,34,35] utilizes expert human rewards about theend-task. Potential-based reward shaping [23] incorporates expert knowledgegrounded in potential functions to ensure policy invariance. Others convert con-trol tasks into supervised measurement prediction task by defining goals andrewards as functions of measurements [12]. In contrast to all these approaches,our sidekicks exploit the observability difference between training and testing totransfer knowledge from a simpler version of the task. This external knowledgedirectly impacts the final policy learned by augmenting task related knowledgevia reward shaping.

Behavior cloning provides expert-generated trajectories as supervised (state,action) pairs [8,17,14,50]. Offline planning, e.g., with tree search, is anotherway to prepare good training episodes by investing substantial computation of-fline [19,3,54], but observability is assumed to be the same between trainingand testing. Guided policy search uses importance sampling to optimize trajec-tories within high-reward regions [39] and can utilize full observability [38], yettransfers from an expert in a purely supervised fashion. Our second sidekickalso demonstrates good action sequences, but we specifically account for theobservability imbalance by annealing supervision over time.

More closely related to our goal is the asymmetric actor critic, which lever-ages synthetic images to train a robot to pick/push an object [48]. Full state in-formation from the graphics engine is exploited to better train the critic. Whilethis approach modifies the advantage expected for a state like our first sidekick,this is only done at the task level. Our sidekick injects a different perspective bysolving simpler versions of the task, leading to better performance (Sec. 4.2).

Policy visualization: Methods for post-hoc explanation of deep networks aregaining attention due to their complexity and limited interpretability. In super-vised learning, heatmaps indicating regions of an image most responsible for adecision are generated via backprop of the gradient for a class label [55,15,52]. Inreinforcement learning, policies for visual tasks (like Atari) are visualized usingt-SNE maps [69] or heatmaps highlighting the parts of a current observation


that are important for selecting an action [18]. We introduce a policy visualiza-tion method that reflects the influence of an agent’s cumulative observations onits action choices, and use it to illuminate the role of sidekicks.

3 Approach

Our goal is to learn a policy for controlling an agent’s camera motions such thatit can explore novel environments and objects efficiently. Our key insight is tofacilitate policy learning via sidekicks that exploit 1) full observability and 2)unlimited time steps to solve a simpler problem in a preparatory training phase.

We first formalize the problem setup in Sec. 3.1. After overviewing observa-tion completion as a means of active exploration in Sec. 3.2, we introduce oursidekick learning framework in Sec. 3.3. We tie together the observation com-pletion and sidekick components with the overall learning objective in Sec. 3.4.Finally, we present our policy visualization technique in Sec. 3.5.

3.1 Problem setup: active visual exploration

The problem setting builds on the “learning to look around” challenge introducedin [30]. Formally, the task is as follows. The agent starts by looking at a novelenvironment (or object) X from some unknown viewpoint 4. It has a budget Tof time to explore the environment. The learning objective is to minimize theerror in the agent’s pixelwise reconstruction of the full—mostly unobserved—environment using only the sequence of views selected within that budget.

Following [30], we discretize the environment into a set of candidate view-points. In particular, the space of viewpoints is a viewgrid indexed by N eleva-tions and M azimuths, denoted by V (X) = {x(X, θ(i))|1 ≤ i ≤ MN}, wherex(X, θ(i)) is the 2D view of X from viewpoint θ(i), which is comprised of two an-gles. More generally, θ(i) could capture both camera angle and position; however,to best exploit existing datasets, we limit camera motions to rotations.

The agent expends the budget in discrete increments, called “glimpses”, byselecting T−1 camera motions in sequence. At each time step, the agent gets ob-servation xt from the current viewpoint. The agent makes an exploratory rotation(δt) based on its policy π. When the agent executes action δt ∈ A, the viewpointchanges according to θt+1 = θt + δt. For each camera motion δt executed bythe agent, a reward rt is provided by the environment (Sec. 3.3 and 3.4). Usingthe view xt, the agent updates its internal representation of the environment,denoted V (X). Because camera motions are restricted to have proximity to thecurrent camera angle (Sec. 4.1) and candidate viewpoints partially overlap, thediscretization promotes efficiency without neglecting the physical realities of theproblem (following [43,29,30,31]).

4 For simplicity of presentation, we represent an environment as X where the agent explores anovel scene, looking outward in new viewing directions. However, experiments will also use X asan object where the agent moves around an object, looking inward at it from new viewing angles.


t = 1 t = 2 t = 3

Fig. 2: Active observation completion. The agent receives one view (shown in red), updates its beliefand reconstructs the viewgrid at each time step. It executes an action (red arrows) according to itspolicy to obtain the next view. The active agent must rapidly refine its belief with well-chosen views.

3.2 Recurrent observation completion network

We start with the deep RL neural network architecture proposed in [30] torepresent the agent’s recurrent observation completion. The process is deemed“completion” because the agent strives to hallucinate portions of the environ-ment it has not yet seen. It consists of five modules: Sense, Fuse, Aggregate,Decode, and Act with parameters Ws, Wf , Wr, Wd and Wa respectively.

– Sense: Independently encodes the view (xt) and proprioception (pt) con-sisting of elevation at time t and relative motion from time t − 1 to t, andreturns the encoded tuple st = Sense(xt, pt).

– Fuse: Consists of fully connected layers that jointly encode the tuple st andoutput a fused representation ft = Fuse(st).

– Aggregate: An LSTM that aggregates fused inputs over time to build theagent’s internal representation at = Aggregate(f1, f2, ..., ft) of X.

– Decode: A convolutional decoder which reconstructs the viewgridVt = Decode (at) as a set of MN feature maps (3MN for 3 channeledimages) corresponding to each view of the viewgrid.

– Act: Given the aggregated state at and proprioception pt, the Act moduleoutputs a probability distribution π(δ|at) over the candidate camera motionsδ ∈ A. An action sampled from this distribution δt = Act(at, pt) is executed.

At each time step, the agent receives and encodes a new view xt, then updatesits internal representation at by sensing, fusing, and aggregating. It decodes theviewgrid Vt and executes δt to change the viewpoint. It repeats the above stepsuntil the time budget T is reached (see Fig. 2). See Supp. for implementationdetails and architecture diagram.

3.3 Sidekick definitions

Sidekicks provide a preparatory learning phase that informs policy learning.Sidekicks have full observability during training: in particular, they can observethe results of arbitrary camera motions in arbitrary sequence. This is impossiblefor the actual look-around agent—who must enter novel environments and re-spect physical camera motion and budget constraints—but it is practical for thesidekick with fully observed training samples (e.g., a 360◦ panoramic image or3D object model, cf. Sec. 4.1). Sidekicks are trained to solve a simpler problemwith relevance to the ultimate look-around agent, serving to accelerate trainingand help the agent converge to better policies. In the following, we define twosidekick variants: a reward-based sidekick and a demonstration-based sidekick.


Reward-based sidekick The reward-based sidekick aims to identify a set ofK views {x(X, θ1), . . . , x(X, θK)} which can provide maximal information aboutthe environment X. The sidekick is allowed to access X and select views withoutany restrictions. Hence, it addresses a simplified completion problem.

A candidate view is scored based on how informative it is, i.e., how wellthe entire environment can be reconstructed given only that view. We train acompletion model (cf. Sec. 3.2) that can reconstruct V (X) from any single view(i.e., we set T = 1). Let V (X|y) denote the decoded reconstruction for X givenonly view y as input. The sidekick scores the information in observation x(X, θ)as:

Info (x(X, θ), X) ∝−1 d(

V (X|x(X, θ)), V (X))

, (1)

where d denotes the reconstruction error and V (X) is the fully observed environ-ment. We use a simple ℓ2 loss on pixels for d to quantify information. Higher-levellosses, e.g., for detected objects, could be employed when available. The scoresare normalized to lie in [0, 1] across the different views of X. The sidekick scoreseach candidate view. Then, in order to sharpen the effects of the scoring functionand avoid favoring redundant observations, the sidekick selects the top K mostinformative views with greedy non-maximal suppression. It iteratively selectsthe view with the highest score and suppresses all views in the neighborhood ofthat view until K views are selected (see Supp. for details). This yields a mapof favored views for each training environment. See Fig 3, top row.

The sidekick conveys the results to the agent during policy learning in theform of an augmented reward (to be defined in Sec. 3.4). Thus, the reward-basedsidekick previews observations and encourages the selection of those individu-

ally valuable for reconstruction. Note that while the sidekick indexes views inabsolute angles, the agent will not; all its observations are relative to its initial(random) glimpse direction. This works because the sidekick becomes a part ofthe environment, i.e., it attaches rewards to the true views of the environment.In short, the reward-based sidekick shapes rewards based on its exploration withfull observability.

Demonstration-based sidekick Our second sidekick generates trajectories ofinformative views. Given a starting view in X, the demonstration sidekick selectsa trajectory of T views that are deemed to be most informative about X. Unlikethe reward-based sidekick above, this sidekick offers guidance with respect to astarting state, and it is subject to the same camera motion restrictions placed onthe main agent. Such restrictions model how an agent cannot teleport its camerausing one unit of effort.

To identify informative trajectories, we first define a scoring function thatcaptures coverage. Coverage reflects how much information x(X, θ) containsabout each view in X. The coverage score for view θ(j) upon selecting viewθ(i) is:

CoverageX

(

θ(j)|θ(i))

∝−1 d(

x(X, θ(j)), x(X, θ(j)))

, (2)

where x denotes an inferred view within V (X|x(X, θ(i))), as estimated using thesame T = 1 completion network used by the reward-based sidekick. Coverage


360 environment - X

Derive coverage

coverageX (( j ) | ( i )) Total coverage at ( j ) = coverageX( ( j ) | t)

t = 1 t = 2 t = 3 t = 4

Derive information scores

Info(x(X, *), X)

Select K-views

( j )

( i )t=1

t=2

t=3

t=4

Demonstration-based sidekick provides a sample trajectory

Reward functionReward-based sidekick augments the reward function

Fig. 3: Top left shows the 360◦ environment’s viewgrid, indexed by viewing elevation and azimuth.Top: Reward sidekick scores individual views based on how well they alone permit inferenceof the viewgrid X (Eq 1). The grid of scores (center) is post-processed with non-max suppressionto prioritize K non-redundant views (right), then is used to shape the agent’s rewards. Bottom:

Demonstration sidekick. Left “grid-of-grids” displays example coverage score maps (Eq 2) for

all θ(i), θ(j) view pairs. The outer N × M grid considers each θ(i), and each inner N × M grid

considers each θ(j) for the given θ(i) (bottom left). A pixel in that grid is bright if coverage is

high for θ(j) given θ(i), and dark otherwise. Each θ(i) denotes an (elevation, azimuth) pair. Whileobserved views and their neighbors are naturally recoverable (brighter), the sidekick uses broaderenvironment context to also anticipate distant and/or different-looking parts of the environment, asseen by the non-uniform spread of scores in the left grid. Given the coverage function and a startingposition, this sidekick selects actions to greedily optimize the coverage objective (Eq 3). The bottomright strip shows the cumulative coverage maps as each of the T=4 glimpses is selected.

scores are normalized to lie in [0, 1] for 1 ≤ i, j ≤ MN .

C(Θ,X) =

MN∑

j=1

∑

θ∈Θ

CoverageX(θ(j)|θ), (3)

The goal of the demonstration sidekick is to maximize the coverage objective(Eqn. 3), where Θ = {θ1, . . . , θt} denotes the sequence of selected views, andC(Θ,X) saturates at 1. In other words, it seeks a sequence of reachable viewssuch that all views are “explained” as well as possible. See Fig. 3, bottom panel.

The policy of the sidekick (πs) is to greedily select actions based on thecoverage objective. The objective encourages the sidekick to select views suchthat the overall information obtained about each view in X is maximized.

πs(Θ) = argmaxδ

C (Θ ∪ {θt + δ}, X) . (4)

We use these sidekick-generated trajectories as supervision to the agent for ashort preparatory period. The goal is to initialize the agent with useful insightslearned by the sidekick to accelerate training of better policies. We achieve thisthrough a hybrid training procedure that combines imitation and reinforcement.In particular, for the first tsup time steps, we let the sidekick drive the action


selection and train the policy based on a supervised objective. For steps tsup toT , we let the agent’s policy drive the action selection and use REINFORCE [64]or Actor-Critic [59] to update the agent’s policy (see Sec. 4). We start withtsup = T and gradually reduce it to 0 in the preparatory sidekick phase (seeSupp.). This step relates to behavior cloning [8,17,14], which formulates policylearning as supervised action classification given states. However, unlike typicalbehavior cloning, the sidekick is not an expert. It solves a simpler version of thetask, then backs away as the agent takes over to train with partial observability.

3.4 Policy learning with sidekicks

Having defined the two sidekick variants, we now explain how they influence pol-icy learning. The goal is to learn the policy π(δ|at) which returns a distributionover actions for the aggregated internal representation at at time t. Let A = {δi}denote the set of camera motions available to the agent.

Our agent seeks the policy that minimizes reconstruction error for the en-vironment given a budget of T camera motions (views). If we denote the setof weights of the network [Ws,Wf ,Wr,Wd,Wa] by W and W excluding Wa byW/a and W exluding Wd by W/d, then the overall weight update is:

∆W =1

n

n∑

j=1

λr∆W rec/a + λp∆W

pol/d (5)

where n is the number of training samples, j indexes over the training sam-ples, λr and λp are constants and ∆W rec

/a and ∆Wpol/d update all parameters

except Wa and Wd, respectively. The pixel-wise MSE reconstruction loss (Lrect )

and corresponding weight update at time t are given in Eqn. 6, where xt(X, θ(i))denotes the reconstructed view at viewpoint θ(i) and time t, and ∆0 denotes theoffset to account for the unknown starting azimuth (see [30]).

Ltrec(X) =

MN∑

i=1

d(

xt(X, θ(i) +∆0), x(X, θ(i)))

,

∆W rec/a = −

T∑

t=1

∇W/aLtrec(X),

(6)

The agent’s reward at time t (see Eqn. 7) consists of the intrinsic rewardfrom the sidekick rst = Info(x(X, θt), X) (see Sec. 3.3) and the negated finalreconstruction loss (−LT

rec(X)).

rt =

{

rst 1 ≤ t ≤ T − 2

−LTrec(X) + rst t = T − 1

(7)

The update from the policy (see Eqn. 8) consists of the REINFORCE update,with a baseline b to reduce variance, and supervision from the demonstration


sidekick (see Eqn. 9). We consider both REINFORCE [64] and Actor-Critic [59]methods to update the Act module. For the latter, the policy term additionallyincludes a loss to update a learned Value Network (see Supp.). For both, weinclude a standard entropy term to promote diversity in action selection andavoid converging too quickly to a suboptimal policy.

∆Wpol/d =

T−1∑

t=1

∇W/dlogπ(δt|at)

( T−1∑

t′=t

rt′ − b(at)

)

+∆W demo/d , (8)

The demonstration sidekick influences policy learning via a cross entropy lossbetween the sidekick’s policy πs (cf. Sec. 3.3) and the agent’s policy π:

∆W demo/d =

T−1∑

t=1

∑

δ∈A

∇/d(πs(δ|at) logπ(δ|at)). (9)

We pretrain the Sense, Fuse, and Decode modules with T = 1. The fullnetwork is then trained end-to-end (with Sense and Fuse frozen). For trainingwith sidekicks, the agent is augmented either with additional rewards from thereward sidekick (Eqn. 7) or an additional supervised loss from the demonstrationsidekick (Eqn. 9). As we will show empirically, training with sidekicks helpsovercome uncertainty due to partial observability and learn better policies.

3.5 Visualizing the learned motion policies

Finally, we propose a visualization technique to qualitatively understand thepolicy that has been learned. The aggregated state at is used by the policynetwork to determine the action probabilities. To analyze which part of theagent’s belief (at) is important for the current selected action δt, we solve forthe change in the aggregated state (∆at) which maximizes the change in thepredicted action distribution (π(·|at)):

∆a∗ = argmax∆at

∑

δ∈A

(

π(δ|at)− π(δ|at +∆at))2

s.t. ||∆at|| ≤ C||at||

(10)

where C is a constant that limits the deviation in norm from the true belief.Eqn. 10 is maximized using gradient ascent (see Supp.). This change in belief isvisualized in the viewgrid space by forward propagating through the Decode

module. The visualized heatmap intensities (Ht) are defined as follows:

Ht ∝ ||Decode(at +∆a∗)−Decode(at)||22. (11)

The heatmap indicates which parts of the agent’s belief would have to change

to affect its action selection. The views with high intensity are those that affectthe agent’s action selection the most.


4 Experiments

In Sec. 4.1, 4.2, we describe our experimental setup and analyze the learning ef-ficiency and test-time performance of different methods. In Sec. 4.3, we visualizelearned policies and demonstrate the superiority of our policies over a baseline.

4.1 Experimental Setup

Datasets: We use two popular datasets to benchmark our models.

– SUN360: SUN360 [66] consists of high resolution spherical panoramas frommultiple scene categories. We restrict our experiments to the 26 categorysubset used in [66,30]. The viewgrid consists of 32×32 views captured across4 elevations (-45◦ to 45◦) and 8 azimuths (0◦ to 180◦). At each step, theagent sees a 60◦ field-of-view. This dataset represents an agent looking outat a scene in a series of narrow field-of-view glimpses.

– ModelNet Hard: ModelNet [65] provides a collection of 3D CAD modelsfor different categories of objects. ModelNet-40 and ModelNet-10 are pro-vided subsets consisting of 40 and 10 object categories respectively, the latterbeing a subset of the former. We train on objects from the 30 categories notpresent in ModelNet-10 and test on objects from the unseen 10 categories. Weincrease completion difficulty in “ModelNet Hard” by rendering with morechallenging lighting conditions, textures and viewing angles than [30]; seeSupp. It consists of 32×32 views sampled from 5 elevations and 9 azimuths.This dataset represents an agent looking in at a 3D object and moving it toa series of selected poses.

For both datasets, the candidate motions A are restricted to a 3 elevationsx 5 azimuths neighborhood, representing the set of unit-cost actions. Neigh-borhood actions mimic real-world scenarios where the agent’s physical motionsare constrained (i.e., no teleporting) and is consistent with recent active visionwork [30,43,29,28,2]. The budget for number of steps is fixed to T = 4.

Baselines: We benchmark our methods against several baselines:

– one-view: the agent trained to reconstruct from one view (T = 1).– rnd-actions: samples actions uniformly at random.– ltla [30]: our implementation of the “learning to look around” approach [30].

We verified our code reproduces results from [30].– rnd-rewards: naive sidekick where rewards are assigned uniformly at ran-

dom on the viewgrid.– asymm-ac [48]: approach from [48] adapted for discrete actions. Critic sees

the entire panorama/object and true camera poses (no experience replay).– demo-actions: actions selected by demo-sidekick while training / testing.– expert-clone: imitation from an expert policy that uses full observability

(similar to critic in Fig. 2 of Supp.)

Evaluation: We evaluate reconstruction error averaged over uniformly sampledelevations, azimuths and all test samples (avg). To provide a worst case analysis,we also report an adversarial metric (adv), which evaluates each agent on itshardest starting positions in each test sample and averages over the test data.


MethodSUN360 ModelNet Hard

avg (×1000) adv (×1000) avg (×1000) adv (×1000)mean ↓ % ↑ mean ↓ % ↑ mean ↓ % ↑ mean ↓ % ↑

one-view 38.31 - 55.12 - 9.63 - 17.10 -rnd-actions 30.99 19.09 44.85 18.63 7.32 23.93 12.38 27.56rnd-rewards 25.55 33.30 30.20 45.21 7.04 26.89 9.66 43.50ltla [30] 24.94 34.89 31.86 42.19 6.30 34.57 8.78 48.65asymm-ac [48] 23.74 38.01 29.92 45.72 6.24 35.20 8.55 50.00expert-clone 23.98 37.38 28.50 48.28 6.41 33.44 8.52 50.13

ours(rew) 23.44 38.82 28.54 48.22 5.80 39.79 7.17 58.04

ours(demo) 24.24 36.73 29.01 47.36 6.32 34.37 8.64 49.47ours(rew)+ac 23.36 39.01 28.26 48.72 5.75 40.26 7.10 58.44

ours(demo)+ac 24.05 37.22 28.52 48.26 6.13 36.31 8.26 51.64

demo-actions* 26.12 31.82 31.53 42.76 5.82 39.50 7.46 56.40Table 1: Avg/Adv MSE errors ×1000 (↓ lower is better) and corresponding improvements (%) overthe one-view model (↑ higher is better), for the two datasets. The best and second best performingmodels are highlighted in green and blue respectively. Standard errors range from 0.2 to 0.3 onSUN360 and 0.1 to 0.2 on ModelNet Hard. (* - requires full observability at test time)

4.2 Active Exploration Results

Table 1 shows the results on both datasets. For each metric, we report the meanerror along with the percentage improvement over the one-view baseline. Ourmethods are abbreviated ours(rew) and ours(demo) referring to the use of ourreward- and demonstration-based sidekicks, respectively. We denote the use ofActor-Critic instead of REINFORCE with +ac.

We observe that ours(rew) and ours(demo) with REINFORCE generallyperform better than ltla with REINFORCE [30]. In particular, ours(rew) per-forms significantly better than ltla on both datasets on all metrics. ours(demo)performs better on SUN360, but is only slightly better on ModelNet Hard. Fig-ure 4 shows the validation loss plots; using the sidekicks leads to significantimprovement in the convergence rate over ltla.

Figure 5 compares example decoded reconstructions. We stress that the vastmajority of pixels are unobserved when decoding the belief state, i.e., only 4views out of the entire viewing sphere are observed. Accordingly, they are blurry.Regardless, their differences indicate the differences in belief states between thetwo methods. A better policy more quickly fleshes out the general shape of thescene or object.

Next, we compare our model to asymm-ac, which is an alternate paradigmfor exploiting full observability during training. First, we note that asymm-ac

performs better than ltla across all datasets and metrics , making it a strongbaseline. Comparing asymm-ac with ours(rew)+ac and ours(demo)+ac, we seeour methods still perform considerably better on all metrics and datasets. As weshow in the Supp, our methods also lead to faster convergence.

In order to contrast learning from sidekicks with learning from experts, weadditionally compare our models to behavior cloning an expert that exploitsfull observability at training time. As shown in Tab. 1, ours(rew) outperformsexpert-clone on both the datasets, validating the strength of our approach. It isparticularly interesting because training an expert takes a lot longer (17×) thantraining sidekicks (see Supp.). When compared with demo-actions, an ablated


Fig. 4: Validation errors (×1000) vs. epochs on SUN360 (left) and ModelNet Hard (right). All modelsshown here use REINFORCE (see Supp. for more curves). Our approach accelerates convergence.

GT viewgrid ltla ours(rew)

Fig. 5: Qualitative comparison of ours(rew) vs. ltla [30] on SUN360 (first 2 rows) and ModelNetHard (last 2 rows). The first column shows the groundtruth viewgrid and a randomly selectedstarting point (marked in red). The 2nd and 3rd columns contain the decoded viewgrids from ltlaand ours(rew) after T = 4 time steps. The reconstructions from ours(rew) are visibly better. Forexample, in the 3rd row, our model reconstructs the protrusion more clearly; in the 2nd row, ourmodel reconstructs the sky and central hills more effectively. Best viewed on pdf with zoom.

version of ours(demo) that requires full observability at test time, our perfor-mance is still significantly better on SUN360 and slightly better on ModelNetHard. ours(rew) and ours(demo) also beat the remaining baselines by a signifi-cant margin. These results verify our hypothesis that sidekick policy learning canimprove over strong baselines by exploiting full observability during training.

4.3 Policy Visualization

We present our policy visualizations for ltla and ours(rew) on SUN360 inFigure 6; see Supp. for examples with ours(demo). The heatmap from Eq 10 isshown in pink and overlayed on the reconstructed viewgrids. For both models,the policies tend to take actions that move them towards views which have low


GT viewgrid t = 1 t = 2 t = 3 t = 4

ours(rew)

ltla

ours(rew)

ltla

Fig. 6: Policy visualization: The viewgrid reconstructions of ours(rew) and ltla [30] are shown ontwo examples from SUN360. The first column shows the viewgrid with a randomly selected view (inred). Subsequent columns show the view received (in red), viewgrid reconstructed, action selected(red arrow), and the parts of the belief space our method deems responsible for the action selection(pink heatmap). Both the agents tend to move towards sparser regions of the heatmap, attempting toimprove their beliefs about views that do not contribute to their action selection. ours(rew) improvesits beliefs much more rapidly and as a result, performs more informed action selection.

heatmap density, as witnessed by the arrows / actions pointing to lower densityregions. Intuitively, the agents move towards the views that are not contributingeffectively to their action selection to increase their understanding of the scene. Itcan observed in many cases that ours(rew) model has a much denser heat mapacross time when compared to ltla. Therefore, ours(rew) takes more viewsinto account for selecting its actions earlier in the trajectory, suggesting that abetter policy and history aggregation leads to more informed action selection.

5 Conclusion

We propose sidekick policy learning, a framework to leverage extra observabilityor fewer restrictions on an agent’s motion during training to learn better poli-cies. We demonstrate the superiority of policies learned with sidekicks on twochallenging datasets, improving over existing methods and accelerating training.Further, we utilize a novel policy visualization technique to illuminate the dif-ferent reasoning behind policies trained with and without sidekicks. In futurework, we plan to investigate the effectiveness of our framework on other activevision tasks such as recognition and navigation.

Acknowledgements

The authors thank Dinesh Jayaraman, Thomas Crosley, Yu-Chuan Su, and IshanDurugkar for helpful discussions. This research is supported in part by DARPALifelong Learning Machines, a Sony Research Award, and an IBM Open Collab-orative Research Award.


References

1. Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. International Journalof Computer Vision (1988)

2. Ammirato, P., Poirson, P., Park, E., Kosecka, J., Berg, A.C.: A dataset for devel-oping and benchmarking active vision. In: Robotics and Automation, 2017 IEEEInternational Conference on (2017)

3. Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning andtree search. In: Advances in Neural Information Processing Systems (2017)

4. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual atten-tion. arXiv preprint arXiv:1412.7755 (2014)

5. Bajcsy, R.: Active perception. Proceedings of the IEEE (1988)6. Ballard, D.H.: Animate vision. Artificial intelligence (1991)7. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.:

Unifying count-based exploration and intrinsic motivation. In: Advances in NeuralInformation Processing Systems (2016)

8. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning forself-driving cars. arXiv preprint arXiv:1604.07316 (2016)

9. Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcementlearning. In: Computer Vision, 2015 IEEE International Conference on (2015)

10. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied QuestionAnswering. In: Computer Vision and Pattern Recognition, 2018 IEEE Conferenceon (2018)

11. Dhillon, S.S., Chakrabarty, K.: Sensor placement for effective coverage and surveil-lance in distributed sensor networks. In: Wireless Communications and Network-ing, 2003. WCNC 2003. 2003 IEEE (2003)

12. Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: Interna-tional Conference on Learning Representations (2017)

13. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An openurban driving simulator. In: Conference on Robot Learning (2017)

14. Duan, Y., Andrychowicz, M., Stadie, B., Ho, O.J., Schneider, J., Sutskever, I.,Abbeel, P., Zaremba, W.: One-shot imitation learning. In: Advances in NeuralInformation Processing Systems (2017)

15. Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningfulperturbation. In: Computer Vision, 2017 IEEE International Conference on (2017)

16. Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depthestimation: Geometry to the rescue. In: European Conference on Computer Vision(2016)

17. Giusti, A., Guzzi, J., Ciresan, D.C., He, F.L., Rodrıguez, J.P., Fontana, F., Faessler,M., Forster, C., Schmidhuber, J., Di Caro, G., et al.: A machine learning approachto visual perception of forest trails for mobile robots. IEEE Robotics and Automa-tion Letters (2016)

18. Greydanus, S., Koul, A., Dodge, J., Fern, A.: Visualizing and understanding atariagents. CoRR (2017)

19. Guo, X., Singh, S., Lee, H., Lewis, R.L., Wang, X.: Deep learning for real-timeatari game play using offline monte-carlo tree search planning. In: Advances inNeural Information Processing Systems (2014)

20. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mappingand planning for visual navigation. In: Computer Vision and Pattern Recognition,2017 IEEE Conference on (2017)


21. Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark basedrepresentations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)

22. Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer.In: Computer Vision and Pattern Recognition, 2016 IEEE Conference on (2016)

23. Harutyunyan, A., Devlin, S., Vrancx, P., Nowe, A.: Expressing arbitrary rewardfunctions as potential-based advice. In: Twenty-Ninth AAAI Conference on Arti-ficial Intelligence (2015)

24. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531 (2015)

25. Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised se-mantic segmentation. In: Advances in neural information processing systems (2015)

26. Hong, S., Oh, J., Lee, H., Han, B.: Learning transferrable knowledge for semanticsegmentation with deep convolutional neural network. In: Computer Vision andPattern Recognition, 2016 IEEE Conference on (2016)

27. Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D.,Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. arXivpreprint arXiv:1611.05397 (2016)

28. Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual catego-rization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)

29. Jayaraman, D., Grauman, K.: Look-ahead before you leap: end-to-end active recog-nition by forecasting the effect of motion. In: European Conference on ComputerVision (2016)

30. Jayaraman, D., Grauman, K.: Learning to look around: Intelligently exploringunseen environments for unknown tasks. In: Computer Vision and Pattern Recog-nition, 2018 IEEE Conference on (2018)

31. Johns, E., Leutenegger, S., Davison, A.J.: Pairwise decomposition of image se-quences for active multi-view recognition. In: Computer Vision and Pattern Recog-nition, 2016 IEEE Conference on (2016)

32. Kim, A., Eustice, R.M.: Perception-driven navigation: Active visual slam forrobotic area coverage. In: Robotics and Automation, 2013 IEEE International Con-ference on (2013)

33. Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement: Thetamer framework. In: Proceedings of the fifth international conference on Knowl-edge capture (2009)

34. Knox, W.B., Stone, P.: Combining manual feedback with subsequent mdp rewardsignals for reinforcement learning. In: Proceedings of the 9th International Confer-ence on Autonomous Agents and Multiagent Systems (2010)

35. Knox, W.B., Stone, P.: Reinforcement learning from simultaneous human andmdp reward. In: Proceedings of the 11th International Conference on AutonomousAgents and Multiagent Systems (2012)

36. Krause, A., Guestrin, C.: Near-optimal observation selection using submodularfunctions. In: AAAI (2007)

37. Lapin, M., Hein, M., Schiele, B.: Learning using privileged information: Svm+ andweighted svm. Neural Networks (2014)

38. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotorpolicies. The Journal of Machine Learning Research (2016)

39. Levine, S., Koltun, V.: Guided policy search. In: International Conference on Ma-chine Learning (2013)

40. Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordinationfor robotic grasping with large-scale data collection. In: Kulic, D., Nakamura, Y.,


Khatib, O., Venture, G. (eds.) 2016 International Symposium on ExperimentalRobotics (2017)

41. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conferenceon Computer Vision (2014)

42. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.Y.: Learningto detect a salient object. IEEE Transactions on Pattern Analysis and Machineintelligence (2011)

43. Malmir, M., Sikka, K., Forster, D., Movellan, J.R., Cottrell, G.: Deep q-learningfor active recognition of germs: Baseline performance on a standardized datasetfor active learning. In: British Machine Vision Conference (2015)

44. Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M.,Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complexenvironments. arXiv preprint arXiv:1611.03673 (2016)

45. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In:Advances in Neural Information Processing Systems (2014)

46. Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., Levine, S.: Com-bining self-supervised learning and imitation for vision-based rope manipulation.In: Robotics and Automation, 2017 IEEE International Conference on (2017)

47. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven explorationby self-supervised prediction. In: International Conference on Machine Learning(2017)

48. Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., Abbeel, P.: Asymmetricactor critic for image-based robot learning. Robotics: Science and Systems (2018)

49. Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50ktries and 700 robot hours. In: Robotics and Automation, 2016 IEEE InternationalConference on (2016)

50. Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and struc-tured prediction to no-regret online learning. In: Proceedings of the fourteenthinternational conference on artificial intelligence and statistics (2011)

51. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet LargeScale Visual Recognition Challenge. International Journal of Computer Vision(2015)

52. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In:Computer Vision, 2017 IEEE International Conference on (2017)

53. Sharmanska, V., Quadrianto, N., Lampert, C.H.: Learning to rank using privilegedinformation. In: Computer Vision, 2013 IEEE International Conference on. IEEE(2013)

54. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A.,Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go withouthuman knowledge. Nature (2017)

55. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-works: Visualising image classification models and saliency maps. arXiv preprintarXiv:1312.6034 (2013)

56. Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.:Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view.In: Computer Vision and Pattern Recognition, 2018 IEEE Conference on (2018)

57. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classesfrom videos in the wild. arXiv preprint arXiv:1212.0402 (2012)


58. Spica, R., Giordano, P.R., Chaumette, F.: Active structure from motion: Applica-tion to point, sphere, and cylinder. IEEE Transactions on Robotics (2014)

59. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction60. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view

reconstruction via differentiable ray consistency. In: Computer Vision and PatternRecognition, 2017 IEEE Conference on (2017)

61. Vapnik, V., Izmailov, R.: Learning with intelligent teacher. In: Symposium onConformal and Probabilistic Prediction with Applications (2016)

62. Wang, B.: Coverage problems in sensor networks: A survey. ACM Computing Sur-veys (CSUR) (2011)

63. Wilkes, D., Tsotsos, J.K.: Active object recognition. In: Computer Vision andPattern Recognition, 1992. IEEE Computer Society Conference on (1992)

64. Williams, R.J.: Simple statistical gradient-following algorithms for connectionistreinforcement learning. In: Reinforcement Learning (1992)

65. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets:A deep representation for volumetric shapes. In: Computer Vision and PatternRecognition, 2015 IEEE Conference on (2015)

66. Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A.: Recognizing scene viewpoint usingpanoramic place representation. In: Computer Vision and Pattern Recognition,2012 IEEE Conference on (2012)

67. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,Bengio, Y.: Show, attend and tell: Neural image caption generation with visualattention. In: International Conference on Machine Learning (2015)

68. Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Computer Vision and Pattern Recognition, 2013 IEEEConference on (2013)

69. Zahavy, T., Ben-Zrihem, N., Mannor, S.: Graying the black box: Understandingdqns. In: International Conference on Machine Learning (2016)

70. Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R.,Farhadi, A.: Visual Semantic Planning using Deep Successor Representations. In:Computer Vision, 2017 IEEE International Conference on (2017)

71. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.:Target-driven visual navigation in indoor scenes using deep reinforcement learning.In: Robotics and Automation, 2017 IEEE International Conference on (2017)

sidekick policy learning for active visual...

Documents