scene-level pose estimation for multiple instances of densely...

4
Scene-level Pose Estimation for Multiple Instances of Densely Packed Objects Chaitanya Mitash, Bowen Wen, Kostas Bekris, Abdeslam Boularias Department of Computer Science Rutgers University, New Jersey, USA I. I NTRODUCTION Robotic pick-and-place is a critical task in many domains, such as warehouse logistics [1, 2], where often many similar looking, densely-packed products must be manipulated. Bin- picking solutions typically integrate perception with plan- ning [3–7], where the perception module is used for estimating 6D object poses. Some systems bypass this step and di- rectly compute grasps using semantic segmentation or directly learning grasp affordances [3, 5, 8–10]. Such pose-agnostic techniques are promising for their simplicity but often it is useful to compute the pose of the observed objects to achieve purposeful manipulation or placement, such as in the context of packing [4, 11]. Estimating 6D object poses has been approached in various ways, such as by matching locally-defined features [12, 13], matching pre-defined templates of object models [14, 15], or via voting in the local object frame using oriented point-pair features [16, 17]. A recent effort has compared several of these techniques on multiple public datasets [18]. Most of the methods were developed and evaluated for setups where each object appears once. The pose estimation problem for multiple instances of the same object has received less attention. This is despite its significance in the application domains. The current work aims to bring pose estimation one step closer to real- world, bin-picking applications by focusing on challenging cases, such as those illustrated in Figure 1. The recent benchmarking effort [18] includes a single public dataset [19] with a considerable number of instances of the same object category. Even then the recall is measured for estimating the pose of any instance in the scene. While this may be sufficient in certain tasks, it is a weaker requirement than the objective of this work, which aims to identify the 6D pose of most, if not all, instances of objects in the scene. Achieving such scene-level pose estimation allows a robot to internally simulate its environment and reason about the order with which objects can be manipulated, as well as their physical interactions. Scene-level reasoning is also able to fill in missing information by taking into account occlusions and physical interactions between different objects. II. APPROACH The current work presents a novel pipeline that takes an RGB-D image and returns 6D poses of all instances of objects in it. The proposed approach is summarized in Figure 2. At a high-level, it operates in four stages: a) at an image Fig. 1. Scenes with multiple instances of densely packed objects and joint 6D poses returned by the proposed approach. level, CNNs are used to detect the semantic object classes and visible boundaries of individual instances, b) given this information, a large set of candidate 6D pose hypotheses are generated for each object class, c) quality scores are computed for each hypothesis, and d) a scene-level pose optimization identifies a consistent subset of poses that maximize the sum of their scores. The proposed methodology provides a sequence of contributions over existing state-of-the-art methods. A. Adversarial Training to Adapt Synthetic Labeling: Data-driven approaches have become popular in pose estima- tion, both for end-to-end learning [20, 21] and as a component of a pose estimation pipeline [22, 23]. A limitation of such approaches is the need for large amounts of labeled data. To alleviate this issue, recent approaches are aiming to solve pose estimation by training entirely in simulation [23–26], even though the focus has not been on the multi-instance case. Nevertheless, it is well understood that CNNs are sensitive to the domain gap that exists between synthetic and real data. Following the recent line of research, the proposed method utilizes labeled data generated in simulation and unlabeled real images to train a CNN for predicting semantic labels of object classes and visible object boundaries. Simulator that generates the training data, mimics the placement of objects, surfaces and camera appearing in test scenarios. This results in

Upload: others

Post on 08-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scene-level Pose Estimation for Multiple Instances of Densely …rss2019.informatik.uni-freiburg.de/RSS_Pioneers_2019... · 2019. 5. 13. · Scene-level Pose Estimation for Multiple

Scene-level Pose Estimation forMultiple Instances of Densely Packed Objects

Chaitanya Mitash, Bowen Wen, Kostas Bekris, Abdeslam BoulariasDepartment of Computer Science

Rutgers University, New Jersey, USA

I. INTRODUCTION

Robotic pick-and-place is a critical task in many domains,such as warehouse logistics [1, 2], where often many similarlooking, densely-packed products must be manipulated. Bin-picking solutions typically integrate perception with plan-ning [3–7], where the perception module is used for estimating6D object poses. Some systems bypass this step and di-rectly compute grasps using semantic segmentation or directlylearning grasp affordances [3, 5, 8–10]. Such pose-agnostictechniques are promising for their simplicity but often it isuseful to compute the pose of the observed objects to achievepurposeful manipulation or placement, such as in the contextof packing [4, 11].

Estimating 6D object poses has been approached in variousways, such as by matching locally-defined features [12, 13],matching pre-defined templates of object models [14, 15], orvia voting in the local object frame using oriented point-pairfeatures [16, 17]. A recent effort has compared several ofthese techniques on multiple public datasets [18]. Most of themethods were developed and evaluated for setups where eachobject appears once. The pose estimation problem for multipleinstances of the same object has received less attention. This isdespite its significance in the application domains. The currentwork aims to bring pose estimation one step closer to real-world, bin-picking applications by focusing on challengingcases, such as those illustrated in Figure 1.

The recent benchmarking effort [18] includes a single publicdataset [19] with a considerable number of instances of thesame object category. Even then the recall is measured forestimating the pose of any instance in the scene. While thismay be sufficient in certain tasks, it is a weaker requirementthan the objective of this work, which aims to identify the 6Dpose of most, if not all, instances of objects in the scene.Achieving such scene-level pose estimation allows a robotto internally simulate its environment and reason about theorder with which objects can be manipulated, as well as theirphysical interactions. Scene-level reasoning is also able to fillin missing information by taking into account occlusions andphysical interactions between different objects.

II. APPROACH

The current work presents a novel pipeline that takes anRGB-D image and returns 6D poses of all instances of objectsin it. The proposed approach is summarized in Figure 2.At a high-level, it operates in four stages: a) at an image

Fig. 1. Scenes with multiple instances of densely packed objects and joint6D poses returned by the proposed approach.

level, CNNs are used to detect the semantic object classesand visible boundaries of individual instances, b) given thisinformation, a large set of candidate 6D pose hypotheses aregenerated for each object class, c) quality scores are computedfor each hypothesis, and d) a scene-level pose optimizationidentifies a consistent subset of poses that maximize thesum of their scores. The proposed methodology providesa sequence of contributions over existing state-of-the-artmethods.

A. Adversarial Training to Adapt Synthetic Labeling:Data-driven approaches have become popular in pose estima-tion, both for end-to-end learning [20, 21] and as a componentof a pose estimation pipeline [22, 23]. A limitation of suchapproaches is the need for large amounts of labeled data. Toalleviate this issue, recent approaches are aiming to solve poseestimation by training entirely in simulation [23–26], eventhough the focus has not been on the multi-instance case.Nevertheless, it is well understood that CNNs are sensitiveto the domain gap that exists between synthetic and real data.Following the recent line of research, the proposed methodutilizes labeled data generated in simulation and unlabeledreal images to train a CNN for predicting semantic labels ofobject classes and visible object boundaries. Simulator thatgenerates the training data, mimics the placement of objects,surfaces and camera appearing in test scenarios. This results in

Page 2: Scene-level Pose Estimation for Multiple Instances of Densely …rss2019.informatik.uni-freiburg.de/RSS_Pioneers_2019... · 2019. 5. 13. · Scene-level Pose Estimation for Multiple

Gradient Boosted Tree Regressor

FCN Pose Hypotheses Generation

Pose Hypothesis Quality Prediction

Scene-level Pose Selection

!" #$ , … !% #$

Model features Scene features

Scene Sampling

Input Image

! #$

Boundarypredictions

Semanticpredictions

#$Congruent Set Matching

&'()*∑,$! #$Subjecttocollisionconstraints

Output

Fig. 2. Overview of the proposed approach

alignment of the distribution of semantic and boundary labelsbetween synthetic and unlabeled real images. A generativeadversarial network (GAN) [27, 28] is utilized to exploit thisalignment and regularize the training of a fully convolutionalnetwork (FCN) to predict per-pixel semantic and boundarylabels. Thus by modeling certain physical aspects of the testscenarios in simulation and with adverserial training for outputspace alignment, the proposed training process is able toachieve good predictions on real data.

B. RGB Boundary Detection for Instance Identification:The method relies on instance boundaries detected on RGBimages to guide and constrain the search for 6D poses. Thisis contrary to previous methods that rely solely on depth mapsto detect boundaries for pose estimation [15, 29]. Currentdepth sensors cannot detect boundaries when objects of thesame type are packed next to each other, similar to the setupdisplayed in Figure 1. Thus, the boundaries are predictedusing a network trained in the same way as the semantic onefor object classes, including the adversarial training process.These boundaries combined with depth-based features providestrong cues for scene-level pose optimization.

C. Pointset Samples from Object+Boundary Predictions:The stochastic output of the semantic and instance boundarynetworks is used to sample sets of points in the RGB-D dataso that they belong with high probability to a single instance.The sampled point sets are then matched to congruent setson object model so as to generate a large but dispersed setof pose hypotheses for each object. This matching processis often used for global point registration [23, 30, 31]. Thekey feature of this work is that the boundary predictions areutilized to limit the selection of points within a single objectinstance and its adaptive nature so as to cover all instances ofthe same object by enforcing dispersion.

D. Learning the Quality of Individual Pose Hypothesis:One key issue in pose estimation is finding a way to evaluatehow good is a candidate pose hypothesis given the availabledata. This is achieved by considering a set of heterogeneousobjectives, such as how well can a candidate hypothesis ex-plain the predicted object segments, the predicted boundariesas well as the observed depth and local surface normals inthe input data. Such multi-objective optimization is typicallysolved by weighing the objectives differently and summingthem into a single one. Instead of relying on manually tuned

TABLE ICOMPARING THE SUCCESS RATES OF DIFFERENT RELATED TECHNIQUES

Approaches Soap Toothpaste AllOURS 0.799 0.852 0.821

Mask-RCNN [37] + StoCS [23] 0.428 0.683 0.537LCHF [19] 0.162 0.443 0.283

PPF-Voting [16] 0.301 0.576 0.419Hinterstoisser at. al [14] 0.370 0.656 0.493

weights, this work shows that it is possible to learn the distanceof a given candidate pose from a ground-truth one by using theabove objectives as features. A gradient boosted tree [32] istrained to automatically integrate these objectives and regressthe distance to the closest ground truth pose.

E. Scene-level Optimization for Selecting Poses: Find-ing the best combination of poses among the candidates isformulated as a combinatorial, constraint optimization prob-lem and solved using an ILP solver. The objective is toselect a subset of hypotheses that maximizes the sum oftheir individual scores, while respecting constraints, such asavoiding perceived collisions between the poses. Scene-leveloptimization has been previously approached as maximizingthe weighted sum of various geometric features [33]. Theweights characterizing the objective function were, however,carefully handcrafted. Other works used combinatorial searchto reconstruct the scene in simulation by sequentially placingobjects in the order of their occlusions [34, 35] or theirphysical-dependencies [36]. These approaches have high com-putational cost unless the poses are restricted to a 3-DoFspace [34]. For images such as those in Figure 1, the scene-level optimization using ILP is achieved in a few milliseconds.

F. Dataset Generation and Evaluation in Difficult Setups:Prior pose estimation datasets focused on objects with differentgeometries placed on a table-top, such as LINEMOD [14] andYCB-Video [20]. There are few datasets that contain severalinstances of the same object [19], but even in this case, thepiles of objects are easy to segment from depth data. Hence,this work created a new benchmark that presents challengingscenarios for multi-instance poses such as aligned surfaces,textureless object and high level of occlusion. In such cases,standard feature-based techniques fail. The empirical resultsTable I have shown that the proposed pipeline outperformsrecent ones on the considered, challenging scenes.

III. DISCUSSION

This work aims to address hard instances of multi-instancepose estimation, which includes highly cluttered and denselypacked scenarios. The achieved results show that the typeof poses used in simulation for training the semantic andthe boundary networks is important. While a simulation-to-reality domain gap exists, it can be bridged by usingappropriate modeling in simulation and adversarial trainingstrategies. Furthermore, a stochastic sampling process and theILP formulation of the scene-level reasoning is able to findcombinations of hypotheses that are both consistent as well asof high-quality given the learned global score function.

Page 3: Scene-level Pose Estimation for Multiple Instances of Densely …rss2019.informatik.uni-freiburg.de/RSS_Pioneers_2019... · 2019. 5. 13. · Scene-level Pose Estimation for Multiple

BIBLIOGRAPHY

[1] N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo,K. Hauser, K. Osada, A. Rodriguez, J. Romano, and P. Wurman.Analysis and Observations From the First Amazon PickingChallenge. IEEE Trans. on Automation Science and Engineer-ing (T-ASE), 2016.

[2] Clemens Eppner, Sebastian Hofer, Rico Jonschkowski, RobertoMartın-Martın, Arne Sieverling, Vincent Wall, and OliverBrock. Lessons from the amazon picking challenge: Fouraspects of building robotic systems. In Robotics: Science andSystems, 2016.

[3] Douglas Morrison, Adam W Tow, M McTaggart, R Smith,N Kelly-Boxall, S Wade-McCue, J Erskine, R Grinover, A Gur-man, T Hunn, et al. Cartman: The low-cost cartesian manip-ulator that won the amazon robotics challenge. In 2018 IEEEInternational Conference on Robotics and Automation (ICRA),pages 7757–7764. IEEE, 2018.

[4] Max Schwarz, Christian Lenz, German Martın Garcıa, Seongy-ong Koo, Arul Selvam Periyasamy, Michael Schreiber, andSven Behnke. Fast object learning and dual-arm coordinationfor cluttered stowing, picking, and packing. In 2018 IEEEInternational Conference on Robotics and Automation (ICRA),pages 3347–3354. IEEE, 2018.

[5] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon,Francois R Hogan, Maria Bauza, Daolin Ma, Orion Taylor,Melody Liu, Eudald Romo, et al. Robotic pick-and-place ofnovel objects in clutter with multi-affordance grasping andcross-domain image matching. In 2018 IEEE InternationalConference on Robotics and Automation (ICRA), pages 1–8.IEEE, 2018.

[6] Andy Zeng, Kuan-Ting Yu, Shuran Song, Daniel Suo,Ed Walker, Alberto Rodriguez, and Jianxiong Xiao. Multi-view self-supervised deep learning for 6d pose estimation inthe amazon picking challenge. In Robotics and Automation(ICRA), 2017 IEEE International Conference on, pages 1386–1383. IEEE, 2017.

[7] Carlos Hernandez, Mukunda Bharatheesha, Wilson Ko, HansGaiser, Jethro Tan, Kanter van Deurzen, Maarten de Vries, BasVan Mil, Jeff van Egmond, Ruben Burger, et al. Team delft’srobot winner of the amazon picking challenge 2016. In RobotWorld Cup, pages 613–624. Springer, 2016.

[8] M. Gualtieri, A. Ten Pas, K. Saenko, and R. Platt. Grasp PoseDetection in Point Clouds. International Journal of RoboticsResearch (IJRR), 36(13-14):1455–1473, 2017.

[9] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu,J. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to planrobust grasps with synthetic point clouds and analytic graspmetrics. In Robotics: Science and Systems, 2017.

[10] M. Gualtieri, A. Ten Pas, and R. Platt. Pick and Place withoutGeometric Object Models. In IEEE International Conferenceon Robotics and Automation (ICRA), 2018.

[11] Rahul Shome, Wei N Tang, Changkyu Song, Chaitanya Mitash,Chris Kourtev, Jingjin Yu, Abdeslam Boularias, and KostasBekris. Towards robust product packing with a minimalisticend-effector. In Robotics and Automation (ICRA), 2019 IEEEInternational Conference on. IEEE, 2019.

[12] A. Collet, M. Martinez, and S. Srinivasa. The MOPEDframework: Object Recognition and Pose Estimation for Ma-nipulation. International Journal of Robotics Research (IJRR),30(10):1284–1306, 2011.

[13] Aitor Aldoma, Zoltan-Csaba Marton, Federico Tombari, WalterWohlkinger, Christian Potthast, Bernhard Zeisl, Radu BogdanRusu, Suat Gedikli, and Markus Vincze. Tutorial: Point cloudlibrary: Three-dimensional object recognition and 6 dof poseestimation. IEEE Robotics & Automation Magazine, 19(3):80–91, 2012.

[14] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, StefanHolzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Modelbased training, detection and pose estimation of texture-less 3Dobjects in heavily cluttered scenes. In Asian Conference onComputer Vision, pages 548–562. Springer, 2012.

[15] Tomas Hodan, Xenophon Zabulis, Manolis Lourakis, StepanObdrzalek, and Jirı Matas. Detection and fine 3d pose estimationof texture-less objects in rgb-d images. In Intelligent Robots andSystems (IROS), 2015 IEEE/RSJ International Conference on,pages 4421–4428. IEEE, 2015.

[16] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model Globally,Match Locally: Efficient and Robust 3D Object Recognition. InIEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 998–1005, 2010.

[17] Joel Vidal, Chyi-Yeu Lin, and Robert Martı. 6d pose estimationusing an improved method based on point pair features. In2018 4th International Conference on Control, Automation andRobotics (ICCAR), pages 405–409. IEEE, 2018.

[18] Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl,Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal,Stephan Ihrke, Xenophon Zabulis, et al. Bop: benchmark for6d object pose estimation. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 19–34, 2018.

[19] Andreas Doumanoglou, Rigas Kouskouridas, Sotiris Malassi-otis, and Tae-Kyun Kim. Recovering 6d object pose andpredicting next-best-view in the crowd. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,pages 3583–3592, 2016.

[20] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and DieterFox. PoseCNN: A Convolutional Neural Network for 6D ObjectPose Estimation in Cluttered Scenes. In Robotics: Science andSystems, 2018.

[21] Wadim Kehl, Fabian Manhardt, Federico Tombari, SlobodanIlic, and Nassir Navab. Ssd-6d: Making rgb-based 3d detectionand 6d pose estimation great again. In Proceedings of theInternational Conference on Computer Vision (ICCV 2017),Venice, Italy, pages 22–29, 2017.

[22] Eric Brachmann, Alexander Krull, Frank Michel, StefanGumhold, Jamie Shotton, and Carsten Rother. Learning 6dobject pose estimation using 3d object coordinates. In Europeanconference on computer vision, pages 536–551. Springer, 2014.

[23] Chaitanya Mitash, Abdeslam Boularias, and Kostas Bekris.Robust 6d object pose estimation with stochastic congruent sets.In British Machine Vision Conference, 2018.

[24] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, andS. Birchfield. Deep Object Pose Estimation for SemanticRobotic Grasping of Household Objects. In Conference onRobot Learning (CoRL), 2018.

[25] Chaitanya Mitash, Kostas E Bekris, and Abdeslam Boularias.A self-supervised learning system for object detection usingphysics simulation and multi-view pose estimation. In Intelli-gent Robots and Systems (IROS), 2017 IEEE/RSJ InternationalConference on, pages 545–551. IEEE, 2017.

[26] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, andR. Triebel. Implicit 3D Orientation Learning for 6D ObjectDetection from RGB Images. In The European Conference onComputer Vision (ECCV), 2018.

[27] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, and YoshuaBengio. Generative adversarial nets. In Advances in neuralinformation processing systems, pages 2672–2680, 2014.

[28] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang,and M. Chandraker. Learning to adapt structured output spacefor semantic segmentation. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018.

[29] J Vidal, C.-Y. Lin, X. Llado, and R. Martı. A Method for 6D

Page 4: Scene-level Pose Estimation for Multiple Instances of Densely …rss2019.informatik.uni-freiburg.de/RSS_Pioneers_2019... · 2019. 5. 13. · Scene-level Pose Estimation for Multiple

Pose Estimation of Free-Form Rigid Objects Using Point PairFeatures on Range Data. Sensors, 18(8):2678, 2018.

[30] N. Mellado, D. Aiger, and N. J. Mitra. Super4PCS Fast GlobalPointcloud Registration via Smart Indexing. In ComputerGraphics Forum, volume 33, pages 205–215. Wiley OnlineLibrary, 2014.

[31] J. Huang, T.-H. Kwok, and C. Zhou. V4PCS: Volumetric4PCS Algorithm for Global Registration. Journal of MechanicalDesign, 139(11), 2017.

[32] Jane Elith, John R Leathwick, and Trevor Hastie. A workingguide to boosted regression trees. Journal of Animal Ecology,77(4):802–813, 2008.

[33] Aitor Aldoma, Federico Tombari, Luigi Di Stefano, and MarkusVincze. A global hypotheses verification method for 3d objectrecognition. In European Conference on Computer Vision.Springer, 2012.

[34] V. Narayanan and M. Likhachev. Discriminatively-guidedDeliberative Perceptino for Pose Estimation of Multiple 3DObject Instances. In Robotics: Science and Systems, 2016.

[35] Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest ChadwickeJenkins. Sum: Sequential scene understanding and manipula-tion. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJInternational Conference on, pages 3281–3288. IEEE, 2017.

[36] Chaitanya Mitash, Abdeslam Boularias, and Kostas E Bekris.Improving 6d pose estimation of objects in clutter via physics-aware monte carlo tree search. In 2018 IEEE InternationalConference on Robotics and Automation (ICRA), pages 1–8.IEEE, 2018.

[37] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick.Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE Interna-tional Conference on, pages 2980–2988. IEEE, 2017.