object 6dof pose estimation of power grid manipulating robot

Object 6DoF Pose Estimation of Power GridManipulating Robot?

Du Shan1[0000−0002−1371−0381] Zhang Xiaoye1,2,3[0000−0002−3354−849X] LiZhongliang1[0000−0002−9987−5901] Yue Jingpeng2[0000−0002−6270−1246]

and Zou Qin1[0000−0001−7955−0782]

1 School of Computer Science, Wuhan University, Wuhan, China2 Electric Power Research Institute of Guangdong Power Grid Co., Ltd., China

3 China Southern Power Grid Technology Co., Ltd., GuangZhou, China

[email protected], [email protected], [email protected],

jp [email protected], [email protected]

Abstract. This paper introduces a six degree-of-freedom (6DoF) poseestimation method to construct a robust machine-vision system for ma-nipulating robots. Generally, 2DoF results generated by traditional ob-ject detectors cannot meet the requirements of manipulating operations,where both the position and posture of targets are needed. Meanwhile,due to the sensitivity to light and the limitation to distance, the depthsensor of RGB-D cameras could not always be reliable. To overcome thesechallenges, we study 6DoF pose estimation from a single RGB image. Toreduce the complexity and computation, we divide the task into fourstages, i.e., data collection and pre-processing, instance segmentation,keypoints prediction, and 2D-to-3D projection. We build the model withdeep neural networks, and test it in practical manipulating tasks. Theexperimental results demonstrate the high accuracy and practicality ofour method.

Keywords: 6DoF pose estimation · instance segmentation · deep learn-ing · manipulating robot

1 Introduction

In a power grid system, the maintenance of transmission lines is the mostcommon and representative operation. It is a labor-intensive and dangerouswork, which causes several fatal accidents every year. With the development andapplication of robot technologies, this problem is expected to be solved gradu-ally. The robot technology could improve the efficiency of equipment operationand maintenance, enhance the safety of personnel, and increase the reliability ofpower grid operations.

? This research was funded by China Postdoctoral Science Foundation under grant2020M672529, and China Southern Power Grid Science and Technology Project un-der grant GDKJXM20192276, GDKJXM20184840 and NYJS2020KJ005-12. (Corre-sponding author: Qin Zou)

2 Du et al.

Nowadays, a number of researches have been conducted on electric poweroperation robots [19, 17, 4, 5, 12, 3]. For example, the Canadian Hydropower Re-search Institute of Quebec has developed an overhead line inspection and repairrobot capable of identifying and repairing single conductor defects; Japan’s Hi-bot company has developed a non-contact inspection function for the internaland external damage of the conductor robot. A cable tunnel detection robot wasdeveloped by the University of Washington which can recognize fault location.Beside the above researches abroad, many studies have also been done in China,e.g., Wuhan University, Shenyang Institute of Automation, Chinese Academyof Sciences, Beihang University, Guangdong Power Grid, etc., have successive-ly developed 110kV, 500kV overhead transmission line inspection robots and110kV, 220kV robots for single conductor tension drainage clamp bolt fasteningand vibration prevention hammer resetting, broken strands repairing, suspen-sion insulators disassembling and installing, etc. Shenzhen Power Supply Bu-reau, Zhejiang Guozi Robot, Shenzhen Langchi and other units have developeda variety of cable tunnel inspection robots. Shandong Electric Power ResearchInstitute and Shenyang Institute of Automation of Chinese Academy of Scienceshave respectively developed robots for live work on distribution lines of 10kVand below.

However, current robots for distribution lines have failed to achieve the goalof replacing humans with machines[21, 10]. Since the low-level intelligence ofmachine has only poor environmental adaptability. One of the main technicalreasons and difficulties is environmental perception and task positioning. There-fore, this paper focuses on this problem: a perception system for the 6DoF poseestimation of the electric manipulating robot.

Object pose estimation has always been a research hotspot. For example,some methods are based on the scale invariant feature matching transform algo-rithm, which is suitable for targets with rich local texture features [13, 11, 6, 22].However, when dealing with poor feature texture problems, these template-basedmatching or dense feature learning methods have shown some shortcomings:1) sensitive to illuminance and occlusion; 2) cumbersome and time-consuming.Some recent methods [14, 16, 18] use CNNs to regress the 2D key-points, andthen use Perspective-n-Point (PnP) algorithm to calculate 6D pose parameter-s. Although CNNs can predict these invisible keypoints by memorizing similarpatterns, these methods also suffer from low robustness, especially when theactual scene changes, or the target itself has noise or changes in illumination.In such conditions, the prediction ability of these method will be significantlydegraded [8, 15].

This paper studies how to use machine vision to estimate the 6DoF pose ofthe target in an open environment, which expands the theory of deep learning-based pose estimation, studies the instance segmentation, keypoints prediction,and restore the target from a 2D view to a 3D pose. A framework for machinevision-based 6DoF pose estimation is designed, which builds the model on deepneural network and direction vector field, and realizes the image segmentation

Object 6DoF Pose Estimation of Power Grid Manipulating Robot 3

and pose estimation. It aims to provide practical solutions and technical supportsfor the manipulating robotics working in wild conditions.

2 6DoF Pose Estimation

This paper focuses on the electrical distribution lines manipulating robots,and tries to estimate the pose in all directions. The core difficulty lies in theoutdoor operation environment. The commonly used 3DoF position coordinatescannot be well operated and 6DoF is needed . 6DoF includes x, y, z spatial co-ordinates and α, β, γ angle information. This information must be recognized todetermine the angle and posture of the robotic arm to perform correct operationson the target.

By solving these problems independently, our method decomposes the 6Dpose estimation task into four sub-tasks, in order to reduce the complexity. Theoverall workflow is shown in Figure 1. Firstly, in the pre-processing stage, itperforms a data augmentation process on the target, which is used to generatemore data and improve the robustness and generalization ability of the network.Next, the instance segmentation network is utilized to obtain all known objectsin the image Example. Then, for each object, the designed network with skipconnections is used to estimate the 2D object coordinates. Finally, the PnPsolution method is used to estimate the 6DoF pose of the target in 3-dimensionenvironment.

2.1 Pre-processing Stage

The training process of deep neural networks requires a large amount of data,and the number of samples directly affects the recognition effect of the model.When the dataset is small, the trained model is prone to overfitting, whichleads to a decrease in performance. Generally speaking, obtaining a massivedata set is a prerequisite for training, and it is also the key to ensuring a goodtraining effect. When a large number of samples could not be obtained easily,data augmentation is essential for training models, especially when we need totrain intensive estimation tasks. In addition, data augmentation can help reducedata set bias and introduce novel examples for deep model training. One of thedirect methods to perform data augmentation is to cut the target object fromthe existing limited data set and paste it on a random background. This paperalso adopts such a technical solution.

This paper uses the original RGB image and mask data to extract object,and transform it to other scenes, and give the scaling, offset and other processingat the same time to achieve data augmentation. Considering that the target isrigid, and our task is to estimate the 6DoF pose, our data augmentation pro-cessing does not include stretch or flip. Because without considering the cameradistortion, the predicted target will not be deformed such as stretching. More-over, the flipping will introduce the wrong frame corner data information, so itis abandoned as well.

4 Du et al.

In most cases, the target may be at any angle and pose. Therefore, duringtraining, it is necessary to effectively correlate the angle and pose data in orderto achieve the effect of predicting all the obtained data. Among them, the mainpoint is the association between keypoints, that is, each keypoint needs to berelatively fixed with the position of targets. These targets can also be associatedwith the corresponding keypoints even when the target is converted.

2.2 Instance Segmentation

The goal of instance segmentation is to classify and segment all known ob-jects in the image. In this stage, we use the multi-task network cascade (MNC)[2] proposed by Dai et al. Based on CNN, it deals with instance-level seman-tic segmentation tasks [1]. In traditional multitasking methods, on the basis ofsharing features, each task is performed at the same time. Each task does notinterfere with each other and is completed independently. The output of theprevious task in the MNC’s multi-task cascading network will be used as theinput of the next task, forming the cascading network shown in the figure belowon the premise of sharing features.

Shared features

Regressing Bbox

Regressing Mask

Categorized instances

B

M

C

L1

L2

L3

Fig. 1: Schematic diagram of instance segmentation

The shared feature map is learned from the first 13 layers of VGG-16. Eachrow in the figure above represents a task. Each task contains a loss, and the loss ofthe next task will also be affected by the previous tasks. MNC’s back propagationalgorithm can be used to complete the training of the whole model. However, theloss of each stage of the model is affected by the loss of the previous stage, andthere is a spatial transformation in the ROI process. MNC has a network layerwith differentiable spatial coordinates, which makes the gradient computable.

1) Regress the Bounding BoxIn the first stage, the network proposes an instance frame without catego-

ry. RPN predicts the position of bounding boxes and the object fraction by


full convolution. Use 3x3 convolution for dimension reduction and use two 1x1convolutions to do classification and regression. The losses at this stage are asfollows:

L1 = L1(B(Θ)) (1)

where Θ represents the parameters to be optimized in the network, B representsthe output of the first stage. When B = Bi, Bi = {xi, yi, wi, hi, pi}, xi, yi rep-resents the center of the bounding box, and the width is wi, height is hi, scorewith pi.

2) Segmentation at Pixel LevelMask segmentation at pixel level is performed on the proposal of each bound-

ing box. For the box generated in the first stage, ROI pooling is used to extractfeatures. ROI pooling can generate fixed length features for any size of featureinput. Two FC (fully connected) layers are added after each bounding box. Thefirst FC reduces the dimension to 256, and the second one generates pixel levelmask regression. The losses at this stage are as follows:

L2 = L2(M(Θ)|B(Θ)) (2)

M is the output of this stage, representing a series of masks. M = {Mi}and Mi

is the output of logistic regression of m ×m dimension. As mentioned earlier,this stage is affected by the first stage.

3) Instance ClassificationGiven the box in the first stage, the feature extraction is also carried out.

Then the mask estimation of the second stage is used for binarization. The lossesin the third stage are as follows:

L3 = L3(C(Θ)|M(Θ), B(Θ)) (3)

C(∗) is the predicted categorize of object.Therefore, the total loss of instance segmentation is as follows:

Lm(Θ) = L1(B(Θ)) + L2(M(Θ)|B(Θ)) + L3(C(Θ)|M(Θ), B(Θ)) (4)

2.3 Keypoints Prediction

The core of target 6DOF pose estimation is to predict the corners of the3D circumscribed rectangle, that is, the keypoints. In this paper, each point onthe target mask is used to predict the location of the keypoints, forming severaldirection vectors, and then the location of keypoints is obtained through theoptimization algorithm.

More specifically, we perform two tasks: semantic segmentation and vectorfield prediction. For pixel p, our output is the semantic label associated with aspecific object and the vector vk(p) representing the direction of 2D keypointx(k)from pixel p to target. Vector vk(p) is the offset between pixel point p and

6 Du et al.

keypoint x(k), namely x(k)−p. By using semantic tags and offsets, we obtain thetarget object pixels and add offsets to generate a set of keypoints assumptions.

Therefore, the position of keypoint is where the offset of vk(p) reaches theminimum.

Lk = min(∑

vk(p)) (5)

However, due to the interference of imprecise mask and other factors, therewill be outliers in the prediction points. Therefore, in the post-processing stage,this paper introduces particle swarm optimization algorithm to remove outliersbefore predicting the position of corner points.

Lk = min(∑

PSO(vk(p))) (6)

Lk is the loss of the keypoints prediction network. Therefore, the final trainingloss function is:

Lall = Lm + λ · Lk (7)

Among them, λ is used to balance the weight of loss function of multitasking,and we set λ = 0.5 in this work.

2.4 2D-to-3D Projection

The PnP (Perspective-n-Point) problem [9, 20] is as follow. Given the match-ing point pairs of n spatial 3D reference points to the camera image 2D projectionpoints, and the coordinates of 3D points in the world coordinate system and 2Dpoints in the image coordinate system are known, then the position and pose ofthe camera and the object need to be calculated.

PnP can calculate the coordinates of the corresponding points in the cameracoordinate system according to the two-dimensional pixel coordinates and thecorresponding three-dimensional space coordinates of the feature points in asingle frame, as shown in Fig. 2. The coordinate transformation relationshipof the target relative to the three-dimensional space coordinate system can beobtained. The position of any point of the target in the camera coordinate spacecan be obtained by this transformation relationship. Therefore, this paper usesthe PnP method to estimate the 6DoF pose of the target in 3 dimension worldcoordinate. According to each frame of the collected image, the 3D space targetpose is estimated. The final target pose is fitted by multiple sets of estimationresults in order to improve the accuracy of pose estimation.

3 Experiments

The experiments are divided into three parts. The first part is the method ofmaking power system clamp dataset. In the second part, we use the open-sourcedataset Linemod to verify the algorithm. In the third part, we do 6-DOF poseestimation experiment based on this dataset.


a

bc

A

B

C

O d D

Fig. 2: 2D-to-3D projection sketch map

3.1 Design the clamp’s dataset

Data is an important part of deep learning algorithm, so we need to build a 6-DOF data set for clamp in power system. We combine 2-D and 3-D information,construct 3-D data through 2-D data, then project the information of 3-D datainto 2-D. At last, 2-D label is constructed. Specifically, first of all, the QR code isplaced around the object that needs to build the data set, and the video recordingis carried out to collect the image data. Then, using the location information ofthe QR code, the three-dimensional pose of the camera is calculated, and thetransform-matrix of multiple images is calculated. Utilizing these matrices andimage data, 3D model can be generated. The keypoints and mask informationof the target object are determined in the 3D model. Finally, the 3D data isprojected into the 2D data to realize the production process of the dataset.

Samples of Original RGB image series 3D Reconstruction Result

Fig. 3: 2D image collection and 3D construction

In this experiment, we placed Aruco QR code around the object and collected1050 images. Some examples are shown in the Fig. 3. Next, the 3D data isgenerated by the QR code data information and image matching algorithm.Then, according to the clamp model scanned by the 3D scanner, as shown inFig. 4, the frame corner data information of the clamp is mapped and calculated.

8 Du et al.

Afterwards, we take the image augmentation combined with the 3D scan-ning model of the clamp. As shown in Fig. 4, we changes different backgroundsto the clamp to improve the data complexity and enhance the robustness andgeneralization ability of the model.

Clampʼs 3D model Synthetic data

Fig. 4: The 3D model and synthetic data

3.2 Results on Public Dataset Linemod

Linemod is the standard benchmark for 6D object pose estimation[7, 6]. TheLinemod dataset consists of image sequences of 13 objects, each of which containsthe real ground pose of a single object of interest in a cluttered environment.CAD models of all objects are also provided. This dataset brings out many chal-lenges in pose estimation: chaotic scenes, objects without texture and changesin lighting conditions.

In order to enhance the robustness of network model, we use the data aug-mentation in Section 2.1 to the original data, then train and verify the augmenteddata. Fig. 5 shows some results of our experiments.

We can see that the prediction of mask is quite accurate, which is basicallyconsistent with that of the ground truth, except for some differences in thecontour. In the prediction of keypoints, the blue box is the ground truth, andthe red box is prediction box. It can seen that there is only a slight differencebetween them. Next, we use two objective indexes to evaluate the algorithm.

Average pixel accuracy. This is an index for image segmentation. It rep-resents the proportion of pixels in each class that are correctly classified, andthen the average of all classes is calculated.

Pixel-deviation in 2D projection images. This is an index for augmentedreality, 6DoF pose estimation and other applications. If the average 2D distancebetween the projection of an object vertex and the estimated pose and the realpose on the ground is less than 5 pixels, the pose is considered correct.


a)

b)

c)

d)

Fig. 5: Mask and 6DOF results on synthetic dataset Linemod. Note that, a)shows six synthesis samples, b) shows the ground-truth masks, c) shows the

keypoints prediction (red lines) compared with ground truth (blue lines), andd) shows the predicted masks.

Table 1: Prediction accuracy of the mask and 6DoF on composite datasetTargets Mask 6DoF

Ape 97.7 95.1BenchVise 98.8 97.7

Camera 97.4 95.5Can 97.1 96.3Cat 99.6 99.3

Driller 99.2 97.2Duck 98.1 94.1

EggBox 99.0 96.6Glue 99.4 95.6

HolePuncher 99.3 99.6Iron 97.1 96.1

Lamp 97.9 94.8Phone 99.9 99.0Avg 98.5 96.8

3.3 Results on Power Manipulating Robot

The clamp’s results are shown in Fig. 6. Two scenes are demonstrate andthe 6DoF pose estimation can be carried out no matter which angle the clampis placed.

Fig. 7 shows a simulated on-site test environment. The operating robot is URseries, and the camera model is Intel Realsense D415. The camera is set up inthe middle of the robot base, and another robot arm is used to control its angle

10 Du et al.

Fig. 6: Clamp’s 6Dof Results

and direction, so as to improve the adaptability of the robot to the environment.There is no need to perceive and operate at a specific angle and specific position,and it can be compatible with the ability to operate multiple cables in the samecamera position.

Fig. 7: Manipulating robot and power transmission line

The accuracy requirements of the robot’s visual perception are that the dis-placement deviation is less than 1.5cm and the angle deviation is less than 20◦.It can be realized by using customized structural tools and high-frequency forcecontrol feedback provided by the control algorithm. The results of our measure-ment show that the displacement deviation is less than 1cm and the angle devi-ation is less than 15◦, which meets the requirements of the electric distributionlines manipulating robot for the accuracy of visual perception.

4 Conclusion

This paper introduced a method for 6DoF pose estimation for manipulatingrobots. The method included four steps, i.e., data collection and pre-processing,


instance segmentation, keypoint prediction, and 2D-to-3D pose estimation. Wepredicted the positions of 8 keypoints by a deep neural network, and removedoutliers by a particle swarm optimization algorithm. Finally, the PnP algorithmwas used to realize the projection from 2D to 3D through the positions of 8keypoints, and achieve the 6DoF pose estimation. Experimental results showedthat, the proposed method held a high accuracy, and has very good applicationprospects in the perception system of manipulating robots for electric distribu-tion lines maintenance.

References

1. Cao, Y., Ju, L., Zou, Q., Qu, C., Wang, S.: A multichannel edge-weighted centroidalvoronoi tessellation algorithm for 3d super-alloy image segmentation. In: IEEEConference on Computer Vision and Pattern Recognition. pp. 17–24 (2011)

2. Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task net-work cascades. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3150–3158 (2016)

3. Dian, S., Liu, T., Liang, Y., Liang, M., Zhen, W.: A novel shrimp rover-basedmobile robot for monitoring tunnel power cables. In: 2011 IEEE InternationalConference on Mechatronics and Automation. pp. 887–892. IEEE (2011)

4. Fan, F., Wu, G., Wang, M., Cao, Q., Yang, S.: Multi-robot cyber physical systemfor sensing environmental variables of transmission line. Sensors 18(9), 3146 (2018)

5. Griepentrog, H.W., Jaeger-Hansen, C.L., Duhring, K., et al.: Electric agricultur-al robot with multi-layer-control. In: Proceedings of International conference ofagricultural engineering (2012)

6. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab,N.: Model based training, detection and pose estimation of texture-less 3d objectsin heavily cluttered scenes. In: Asian conference on computer vision. pp. 548–562.Springer (2012)

7. Kaskman, R., Zakharov, S., Shugurov, I., Ilic, S.: Homebreweddb: Rgb-d dataset for6d pose estimation of 3d objects. In: Proceedings of the IEEE/CVF InternationalConference on Computer Vision Workshops. pp. 0–0 (2019)

8. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In: Proceedings of theIEEE international conference on computer vision. pp. 1521–1529 (2017)

9. Lepetit, V., Moreno-Noguer, F., Fua, P.: Epnp: An accurate o (n) solution to thepnp problem. International journal of computer vision 81(2), 155 (2009)

10. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: Deepim: Deep iterative matchingfor 6d pose estimation. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 683–698 (2018)

11. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedingsof the seventh IEEE international conference on computer vision. vol. 2, pp. 1150–1157. Ieee (1999)

12. Lu, S., Zhang, Y., Su, J.: Mobile robot for power substation inspection: a survey.IEEE/CAA Journal of Automatica Sinica (2017)

13. Ng, P.C., Henikoff, S.: Sift: Predicting amino acid changes that affect protein func-tion. Nucleic acids research 31(13), 3812–3814 (2003)

12 Du et al.

14. Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-dof object posefrom semantic keypoints. In: 2017 IEEE international conference on robotics andautomation (ICRA). pp. 2011–2018. IEEE (2017)

15. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting net-work for 6dof pose estimation. In: Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition. pp. 4561–4570 (2019)

16. Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion methodfor predicting the 3d poses of challenging objects without using depth. In: Proceed-ings of the IEEE International Conference on Computer Vision. pp. 3828–3836(2017)

17. Song, Y., Wang, H., Jiang, Y., Ling, L.: Aape-d: A novel power transmission linemaintenance robot for broken strand repair. In: 2012 2nd International Conferenceon Applied Robotics for the Power Industry (CARPI). pp. 108–113. IEEE (2012)

18. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose pre-diction. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 292–301 (2018)

19. Wang, B., Chen, X., Wang, Q., Liu, L., Zhang, H., Li, B.: Power line inspectionwith a flying robot. In: 2010 1st International Conference on Applied Robotics forthe Power Industry. pp. 1–6. IEEE (2010)

20. Wu, Y., Hu, Z.: Pnp problem revisited. Journal of Mathematical Imaging andVision 24(1), 131–141 (2006)

21. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neuralnetwork for 6d object pose estimation in cluttered scenes. arXiv preprint arX-iv:1711.00199 (2017)

22. Zhang, X., Ma, Y., Fan, F., Zhang, Y., Huang, J.: Infrared and visible image fusionvia saliency analysis and local edge-preserving multi-scale decomposition. JOSA A34(8), 1400–1410 (2017)

object 6dof pose estimation of power grid manipulating robot

Documents