# touch based pomdp manipulation via sequential submodular ... · pdf filetouch based pomdp...

Post on 28-Jun-2018

218 views

Embed Size (px)

TRANSCRIPT

Touch Based POMDP Manipulation via Sequential Submodular Optimization

Ngo Anh Vien and Marc Toussaint

Abstract Exploiting the submodularity of entropy-relatedobjectives has recently led to a series of successes in ma-chine learning and sequential decision making. Its generalizedframework, adaptive submodularity, has later been introducedto deal with uncertainty and partially observability, achievingnear-optimal performance with simple greedy policies. As aconsequence, adaptive submodularity is in principle a promisingcandidate for efficient touch-based localization in robotics.However, applying that method directly on the motion levelshows poor scaling with the dimensionality of the system.Being motivated by hierarchical partially observable Markovdecision process (POMDP) planning, we integrate an actionhierarchy into the existing adaptive submodularity framework.The proposed algorithm is expected to effectively generateuncertainty-reducing actions with the help from an actionhierarchy. Experimental results on both, a simulated robot anda Willow Garage PR2 platform, demonstrate the efficiency ofour algorithm.

I. INTRODUCTIONEfficient object manipulation typically requires a plan of

actively contact-seeking actions to reduce uncertainty overthe true environmental model, e.g., poses and positions ofobjects to be grasped or touched as well as obstacles. Whilevision is usually the primary sensor information to reduceuncertainty, in this paper we focus on haptic feedback.Humans are extremely skilled in object manipulation alsowhen deprived of vision. We therefore consider the scenarioof a robot entering a dark room and seeking for an objecton a table as shown in Fig. 1. The only sensor informationare force/torque signals in the end-effector. This task is verychallenging not only for robots but also humans, as thetask includes a lot of uncertainty [1]. To solve this type oftasks, humans usually seek contacts with objects in orderto disambiguate uncertainty. In principle, this task can bemathematically formulated as a partially observable Markovdecision process (POMDP) whose state space consists ofhidden states that can only be inferred through observations.For instance, the poses and locations of objects are notdirectly observable, but sensed contact forces are observable[2][4]. The resulting POMDP would have high-dimensionalcontinuous state, observation and action spaces and a verylong horizon if we consider low-level robot control as theaction level. Solving this general POMDP is known to bePSPACE hard [5]. Therefore approximation and heuristicmethods are needed.

In this paper, we propose methods to approximate and effi-ciently solve the problem of manipulation under uncertaintyvia tactile feedback. We approximate the problem by usinghigh-level actions to deal with the long horizon problem [6],[7], i.e. macro actions, and use a sample-based approach todeal with both high-dimensional and long horizon problems[8]. Though being approximated, naively applying standard

This work was supported by the EU-ICT Project 3rdHand 610878.Ngo Anh Vien & Marc Toussaint are with the Machine

Learning & Robotics Lab, University of Stuttgart, Germany,{vien.ngo;marc.toussaint}@ipvs.uni-stuttgart.de

Root

Detect_Z Detect_E

a1Z a2

Z aNZ a1

E a2E aM

E. . . . . .

Fig. 1. The peg-in-hole-like task: (left) A robot is localizing a table inorder to press a button at the center of the table; (right) An action hierarchy:Detect Z is a subtask to detect the tables height, Detect E is a subtask todetect the tables edges.

POMDP solvers will still take relatively high cost to find agood policy. However, by re-designing the objective functionto be submodular (i.e. diminishing returns and monotonic)we can efficiently apply the recently introduced adaptivesubmodularity framework, whose greedy policies are provedto guarantee near-optimal performance [9], [10].

Both methods, POMDP and adaptive submodularity, canseek contacts with objects for uncertainty disambiguation.We go one step further in combining them: To address morecomplex task in which contacts are harder to be made wepropose to decompose the task into smaller sub-tasks, asin hierarchical POMDP planning [11]. Each sub-task nowcorresponds to exactly one adaptive submodular optimizationtask.

In summary, our contributions are three-fold: We integrate the benefit of hierarchical POMDP plan-

ning, using action decomposition, into the existing adap-tive submodularity formulation [10]. The integration isexpected to make adaptive submodularity able to tacklemore complex tasks in which many actions might notreturn contact information. Such actions do not helpin uncertainty disambiguation, as establishing contactsis the key to success in manipulation tasks. Actiondecomposition like in hierarchical POMDP is expectedto guide the contact-seeking search better.

The action set can be either seeking contacts as in theprevious method [10] or keeping contacts. All thoseactions are defined similarly to standard macro actionsin hierarchical POMDP, which are multiple-step actions,and able to terminate under certain conditions.

Action selection at the higher level of subtasks canbe effectively optimized via POMDP solvers with ap-proximate models of the subtasks [11], or via adaptivesubmodularity with suitable cost functions [10].

In section II, we briefly review background knowledgeabout POMDP and submodularity. Next, section III describes

how to integrate an action hierarchy of POMDP into anadaptive submodularity framework. Experiment results aredescribed in section IV. Finally, our conclusion with someremarks is given in section V.

II. BACKGROUNDIn this section, we briefly give an introduction to the

POMDP framework and adaptive submodular optimization.

A. Partially Observable Markov Decision ProcessRobot manipulation under uncertainty problems can

in principle be formulated as a partially observableMarkov decision process (POMDP). A POMDP is 7-tuple {S,A,O, T ,Z, C, }, where S,A,O are state, con-trol action, and observation spaces. The transition functionT (s, a, s) = p(s|s, a) defines a probability of next statesif taking action a at state s. The observation functionZ(s, a, o) = p(o|s, a) defines a probability of observations.The cost function is C(s, a, s), and the parameter is adiscount factor. An agent must find an optimal policy, whichis a mapping : H 7 A from the history space to the actionspace, that minimizes the cumulative discounted costs

() = E{t

tct} (1)

A history ht H is a sequence of t pairs of actions andobservations {a0, o0, a1, o1, . . . , at1, ot1}.

In our example problem in Fig. 1, the states are s Rn+ne , where the robots joint configuration is x Rn(assumed to be observable), and the environment state is e Rne that is unobservable to the robot (i.e. the environmentmodel, e.g. table position and size, object location). Thecontrol actions a are motor commands computed by theoperational space/force controller. Observations o are sensedforces of a F/T sensor at the wrist of the PR2 robotsarm. Alternatively, one can model observations as binaryfeedback, i.e. contacts. Based only on a sequence of contacts,the robot should be able localize the table to accomplishhis task. Assuming the robot arm always starts above thetable, an optimal policy might look like this: the robot firstgoes down from top until sensing contact with the table,at which point the tables surface is localized. His nextoptimal macro action is moving sideways while still keepingthe contact with the tables surface plane, until the contactvanishes. The robot could effectively infer the tables edgesat those contact-losing positions. By finding more similarpoints at edges, the robot could disambiguate uncertainty ofthe tables size, location, and orientation. However, findingsuch an approximately optimal policy for a POMDP is anon-trivial task [3], [8], [12], which is further proved to beNP-hard [13].

B. Adaptive SubmodularityIn the case of a submodular and monotonic objective func-

tion, a greedy strategy can be guaranteed to achieve a near-optimal performance [14], [15]. Consequently, submodularoptimization has recently been widely applied in machinelearning [16] because of its efficiency and simplicity. Later,submodulatiry was generalized to adaptive planning, hencenamed adaptive submodularity [9]. In this adaptive setting,the state is unobservable and observations are generatedby actions. This framework is a special formulation of aPOMDP in that the state is not influenced by actions. In

other words, the transition probability is supposed to bep(s|s, a) = s(s). Below, we describe this framework indetail.

Assume that the true underlying state is fixed to be s; inour example these are the unknown parameters of the table.There is an observation function, also called realization, : A 7 O. For instance, after executing an action weobserve contacts with a part of the table. After choosingan action a A, an observation (a) is observed. Asthe realization is initially unknown, we can denote by arandom realization. Analogous to maintaining the full historyh H in the POMDP case, in adaptive submodularity wemaintain a partial realization A O where (a, o) if (a) = o has previously been observed. Denote bydom() = {a : o, (a, o) } the domain of . If arealization and a partial realization are equal in thewhole domain of , is said to be consistent with , hencedenoted as . If two partial realizations 1 and 2 bothare consistent with , and dom(1) dom(2), then 1 issaid to be a subrealization of 2. The corresponding randomvariable of a partial realizations is . Summing up, we canwrite the posterior over the realization conditional on apartial realization as p(|) = p( = | ). This issimilar to the belief rep