night-time indoor relocalization using depth image with ...cs li.pdf · night-time indoor...

6
Night-time Indoor Relocalization Using Depth Image with Convolutional Neural Networks Ruihao Li 1 , Qiang Liu 1 , Jianjun Gui 1 , Dongbing Gu 1 and Huosheng Hu 1 Abstract— In this work, we present a Convolutional Neural Network(CNN) with depth images as its inputs to solve the relocalization problem of a moving platform in night-time indoor environment. The developed algorithm can estimate the camera pose in an end-to-end manner with 0.40m and 7.49 errors in real time during night. It does not require any geometric computation as it directly uses a CNN for 6 DOFs pose regression. The architecture and its encoding methods of depth images are discussed. The proposed method is also evaluated on benchmark datasets collected from a motion capture system in our lab. I. INTRODUCTION Relocalization in an unknown environment has been wide- ly studied in the research filed of autonomous mobile robots. Robots can localize themselves using this technology in the environment where they visit before or they get lost before. It requires the pre-visit to the environment in order to collect sufficient images for network training purpose, which is slightly different from Simultaneous Localization and Mapping (SLAM) which helps robots localize and map in an unknown environments without the requirement of pre- visits [1], [2]. When lights are off or in night-time, RGB image solutions often encounter failures for the reason that image quality becomes poor. Most RGB image based SLAM techniques fail to produce accurate location information in these cir- cumstances. However depth images from infra-red (RGB-D camera) or laser range finders are able to provide valued information during night-time [3] [4] since their quality is not affected by light and are suitable to be used in night- time. Deep Convolutional Neural Networks(CNNs) which are designed for image processing have proved their efficiency in recognition and detection problems. It is an unsupervised learning method and provides an end-to-end solution to perception problems [5] [6]. Recently there are also some researchers who began to use CNNs to solve pose regression and estimation problems [7] [8] without the need of using geometry methods. However, the designed CNN mainly aims at color images and depth images can not be fed to the network directly. In this work, we leverage a deep convolutional neural network which uses depth images as its inputs to solve the relocalization problem in indoor environment during night- time. The architecture of the convolutional neural network 1 Ruihao Li, Qiang Liu, Jianjun Gui, Dongbing Gu and Huosheng Hu are with School of Computer Science and Electronic Engineering, University of Essex, Colchester, CO4 3SQ, UK {rlig, qliui, jgui, dgu, hhu}@essex.ac.uk for pose regression is presented. Different depth image preprocessing methods are discussed and compared in order to achieve a satisfied performance. Our main contribution is the use of a convolutional neural network with depth images as its input to solve the relocalization problems in night-time. In the next section, we provide a review of related works. In section III, we introduce the architecture of deep convo- lutional neural network for pose estimation and the methods for depth image preprocessing. In section IV, the evaluation experiments for proposed method are performed and the system performance is analyzed. Finally our work in this paper is concluded in section V. II. RELATED WORK Relocalization using cameras is widely studied in robotics and computer vision. In the early years, registration from local point cloud(image) to global point cloud(map) using Iterative Closest Point(ICP) [9] is usually adopted to solve relocalization problems. Steder at el. [10] extracted 3D point features from ranging images, realized place recognition and point cloud registration using ICP. Cupec at el. [11] used both line and planar surfaces segments detected from Kinect camera as features to recognize places and then relocalize the camera. Fern´ andez-Moralc at el. [12] also introduced a scene registration and localization strategy based on multi- planar patches which are featured with normal, color and so on. However, this kind of methods is time-consuming and can hardly meet the requirement of real-time in large scale environment. After Nister at el. [13] proposed bags of words technology to solve recognition problems based on appearance, this technology combined with image to image registration are preferred in recent years for their better performance [14]. Cummnins at el. [15] presented FAB-MAP using SURF fea- tures [16] to implement large scale relocalization, but SURF feature computation is too cost to run in real-time. G´ alvez- opez at el. [17] realized fast place localization using bags of words with FAST [18] and BRIEF [19] features. Mur-Artal at el. [20] proposed to use bags of words with ORB features [21] which could be computed within 10ms and is rotation and scale invariant. In this way, it can achieve robust and fast relocalization. However, bags of words technology must use point features extracted from color images and limits its application in bright environments. Shotton at el. [22] introduced a regression forest approach to relocalization using RGB-D cameras. The pre-trained for- est could estimate the correspondence of 3D points in local frames and points in global map without feature extraction.

Upload: haminh

Post on 17-Apr-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

Night-time Indoor Relocalization Using Depth Image withConvolutional Neural Networks

Ruihao Li1, Qiang Liu1, Jianjun Gui1, Dongbing Gu1 and Huosheng Hu1

Abstract— In this work, we present a Convolutional NeuralNetwork(CNN) with depth images as its inputs to solve therelocalization problem of a moving platform in night-timeindoor environment. The developed algorithm can estimatethe camera pose in an end-to-end manner with 0.40m and7.49◦ errors in real time during night. It does not requireany geometric computation as it directly uses a CNN for6 DOFs pose regression. The architecture and its encodingmethods of depth images are discussed. The proposed method isalso evaluated on benchmark datasets collected from a motioncapture system in our lab.

I. INTRODUCTION

Relocalization in an unknown environment has been wide-ly studied in the research filed of autonomous mobile robots.Robots can localize themselves using this technology inthe environment where they visit before or they get lostbefore. It requires the pre-visit to the environment in orderto collect sufficient images for network training purpose,which is slightly different from Simultaneous Localizationand Mapping (SLAM) which helps robots localize and mapin an unknown environments without the requirement of pre-visits [1], [2].

When lights are off or in night-time, RGB image solutionsoften encounter failures for the reason that image qualitybecomes poor. Most RGB image based SLAM techniquesfail to produce accurate location information in these cir-cumstances. However depth images from infra-red (RGB-Dcamera) or laser range finders are able to provide valuedinformation during night-time [3] [4] since their quality isnot affected by light and are suitable to be used in night-time.

Deep Convolutional Neural Networks(CNNs) which aredesigned for image processing have proved their efficiencyin recognition and detection problems. It is an unsupervisedlearning method and provides an end-to-end solution toperception problems [5] [6]. Recently there are also someresearchers who began to use CNNs to solve pose regressionand estimation problems [7] [8] without the need of usinggeometry methods. However, the designed CNN mainly aimsat color images and depth images can not be fed to thenetwork directly.

In this work, we leverage a deep convolutional neuralnetwork which uses depth images as its inputs to solve therelocalization problem in indoor environment during night-time. The architecture of the convolutional neural network

1Ruihao Li, Qiang Liu, Jianjun Gui, Dongbing Gu and Huosheng Hu arewith School of Computer Science and Electronic Engineering, University ofEssex, Colchester, CO4 3SQ, UK {rlig, qliui, jgui, dgu,hhu}@essex.ac.uk

for pose regression is presented. Different depth imagepreprocessing methods are discussed and compared in orderto achieve a satisfied performance. Our main contribution isthe use of a convolutional neural network with depth imagesas its input to solve the relocalization problems in night-time.

In the next section, we provide a review of related works.In section III, we introduce the architecture of deep convo-lutional neural network for pose estimation and the methodsfor depth image preprocessing. In section IV, the evaluationexperiments for proposed method are performed and thesystem performance is analyzed. Finally our work in thispaper is concluded in section V.

II. RELATED WORK

Relocalization using cameras is widely studied in roboticsand computer vision. In the early years, registration fromlocal point cloud(image) to global point cloud(map) usingIterative Closest Point(ICP) [9] is usually adopted to solverelocalization problems. Steder at el. [10] extracted 3D pointfeatures from ranging images, realized place recognition andpoint cloud registration using ICP. Cupec at el. [11] usedboth line and planar surfaces segments detected from Kinectcamera as features to recognize places and then relocalizethe camera. Fernandez-Moralc at el. [12] also introduced ascene registration and localization strategy based on multi-planar patches which are featured with normal, color and soon. However, this kind of methods is time-consuming andcan hardly meet the requirement of real-time in large scaleenvironment.

After Nister at el. [13] proposed bags of words technologyto solve recognition problems based on appearance, thistechnology combined with image to image registration arepreferred in recent years for their better performance [14].Cummnins at el. [15] presented FAB-MAP using SURF fea-tures [16] to implement large scale relocalization, but SURFfeature computation is too cost to run in real-time. Galvez-Lopez at el. [17] realized fast place localization using bags ofwords with FAST [18] and BRIEF [19] features. Mur-Artalat el. [20] proposed to use bags of words with ORB features[21] which could be computed within 10ms and is rotationand scale invariant. In this way, it can achieve robust andfast relocalization. However, bags of words technology mustuse point features extracted from color images and limits itsapplication in bright environments.

Shotton at el. [22] introduced a regression forest approachto relocalization using RGB-D cameras. The pre-trained for-est could estimate the correspondence of 3D points in localframes and points in global map without feature extraction.

Fig. 1: Convolutional Neural Network for relocalization. The input of the CNN is a colorized depth image with size3×224×224, and the output of the CNN is camera pose including rotation represented by quaternion and translation. In thefigure above, Conv represents convolutional layer, Icp represents Inception modules and fc represents full connected layer.The CNN converges with Euclidean loss layer.

The camera pose was then estimated using robust optimiza-tion. Deep convolutional neural networks with color imagesas inputs was proposed by Kendall at al. [7] to solve the poseregression problems. It produced a fast relocalization resultin large scale environment with light on or in daytime. Heimproved the convolutional neural network later and modeleduncertainty for camera relocalization. [8]

As to depth images preprocessing in deep learning, mostliteratures focus on recognition or detection problems. Lenzat el. [23] and Hinterstoisser at el. [24] used surface normalfeatures as three channels of image pixel and acquired satis-fying results for object detection. Gupta at el. [25] proposedto encode depth images using horizontal disparity, heightabove ground and angle with gravity(HHA) for each pixel ofobjects in images, but it could not be used for entire image.Schwarz at el. [26] and Eitel at el. [27] used colorized depthimages as network input, also achieved a good performanceon object recognition with RGB-D cameras. In our paper, wewill discuss the encoding method of depth images for poseregression problem.

III. RELOCALIZATION WITH DEPTH USINGCONVOLUTIONAL NEURAL NETWORKS

A. Network architecture

Here we briefly introduce the architecture of deep convo-lutional neural network—PoseNet [7] we used in the paper asshown in Fig. 1. Based on one of the state of art deep neuralnetwork GoogLeNet [28] which is aiming at recognition

problems, PoseNet is a revised network version with 23layers solving pose regression problems. By adding full con-nected layers(Fc, Fc-1, Fc-2) for extracting localization fea-tures and replacing softmax classifier layers with Euclideanloss layers, the whole network converges to a 7-dimensionvector and becomes an end-to-end pose generator. The inputsof the network are images and the output of the networkare camera poses composed of translation(3-dimension) androtation represented by quaternion(4-dimension).

In the network shown in Fig. 1, there are nine inceptionmodules which are improved from the paper named Networkin network [29]. The architecture of inception modules isshown in Fig. 2. By using inception modules, one canincrease the width of the network without the increase ofcomputational complexity. Compared with similarly perfor-mance networks without inception modules, the networkswith inception modules can gain faster speed, and theyoutperform others with similar depth.

For the deep convolutional neural network designed forregression, an Euclidean loss layer which mainly focuseson least-squared regression tasks is used instead of softmaxclassifier layers. With an inner product layer as the input ofEuclidean layer, it formulates a linear least-squared problem.The Euclidean loss E is computed in layers as shown below:

E =1

2N

N

∑n=1‖xn− xn‖2

2 (1)

Where xn is the estimated vector produced by CNN in an

Fig. 2: Architecture of Inception module in Network.

end-to-end manner and xn is the labeled (ground truth) vectorcorresponding to the input.

In the CNN we use in the paper, the camera position pis composed of translation x and rotation represented byunit quaternion q. Therefore the Euclidean loss of cameraposition Ep can be computed as below:

Ep = Ex +λEq (2)

Where Ex is translational Euclidean loss, Eq is rotationalEuclidean loss and λ is a balance weight between translation-al loss and rotational loss. λ is a significant factor for poseregression for the reason that rotation represented by unitquaternion and translation have different measurements. Soλ is introduced to weight rotational loss in order to balanceEx and Eq.

B. Depth image preprocessing

In daytime or environment with light on, color images orboth color and depth images are used to locate the camerain robot vision. For dim or night-time circumstances, colorimages will encounter failures and could not be used any-more. Other sensors, such as lasers or sonars, are employedinstead. In this paper, depth images from RGB-D camera areapplied as the input of CNN to relocalize camera.

Convolutional neural networks are particularly designedfor color image. Each pixel channel in color image representsoptical intensity. On the contrary, physical meaning of pixelsin depth images is the scaled distance from object to the opticcenter of camera. If original depth images are used directlyas CNN input, the performance of CNN is usually not verysatisfied. Some preprocessing steps should be taken for depthimages before feeding them into CNN.

The simplest method to process depth images is to re-rangedepth values from 0–10000 to 0–255. The preprocessingresult of this method is a single layer grayscale image whichis shown in Fig. 3(a). Based on the single layer grayscaledepth image, we can create three duplicates of the grayscalelayer and extend the image to a triple-layer scaled depthimage which is shown in Fig. 3(b). There is no big differencebetween two methods above except that transfer learningmodels are different. Another proposed depth preprocessingmethod is to compute the surface normals from depth images[23] [24]. The absolute value of normal [nx,ny,nz] should becomputed first and then re-ranged from 0–1 to 0–255. Thenormalized depth image is shown in Fig.3(c). From Fig. 3(c),we can see that all pixels on ground have same color(green)even though they have different depth values. The color ofpixels differs according to their normals and the object couldbe distinguished clearly in normalized depth image. In thispaper, colorized depth images are adopted as the input vectorof CNN as shown in Fig. 3(d). Each intensity value in thegrayscale image corresponds to a color value. Fig. 3(d) isthe colorized jet map, as the intensity value increases from 0to 255, the color changes from blue to green, then yellow tored. The ground pixels in Fig. 3(d) show apparent differentcolors, and at the same time, pixels on the same object withdiscriminative depth hierarchies are labeled with differentcolor.

Another problem needs to be considered is the size ofimage. The Convnet shown in Fig. 1 requires an input imagewith the size of 224× 224. The original depth image fromRGB-D camera is 640×480. In this paper, depth images areresized to 341×256 and then random cropped according tothe requirement of CNN input. We have also tried to resizeoriginal depth images directly to 224× 224 and maintainall the information from depth, but found that relocalizationperformance is not as good as expected.

There are also some missing values in original depth im-ages from RGB-D camera. These values are usually handledwith zeros because it turns out that there is no big differencewhen adopting other processing methods.

IV. EXPERIMENT

The evaluation of proposed method for night-time indoorrelocalization is conducted on benchmark datasets which arecollected from our Robot Arena lab equipped with a motioncapture system—Vicon. The motion capture system whichconsists of eight high speed cameras can provide trajectoryground truth even in dim or night-time environment.

The deep convolutional neural network is designed usingCaffe library [30] and is trained on a desktop equippedwith Nvidia GeForce Titan X GPU card and Intel Core i7-4790 4.0GHz CPU. The significant parameters in trainingprocedure are shown in Table I.

A. Dataset

In order to evaluate the performance of our CNN usingdepth images as input in night-time environment, we first

Fig. 3: Different processing methods of depth image. (a) Single layer scaled depth image. (b) Triple-layer scaled depthimage. (c) Normalized depth image with computed normal parameters as three image channels. (d) Colorized image usingJet color map.

TABLE I: Significant parameters for training the convolu-tional neural network aiming at pose regression.

Weight λ 300

Solver type Stochastic gradient descent(SGD)

Base learning rate 0.00001

Learning rate policy Step

Step size 80 epochs

Gamma 0.94

Momentum 0.9

collected several sequences images all with light on or indaytime. Depth images from these sequences are set as thetraining dataset. Then we collected images with light off andimages in night-time separately which are used as the testingdataset. The quality of depth images is not affected by light.The RGB-D camera used here is Xtion. There are about5000 images in training dataset and the testing dataset forevaluation contains about 700 images.

Fig. 4 shows various images in different circumstances.As we can see that when lights in Arena lab are off orthe sun descends, the quality of color images has beenaffected seriously. Even people can hardly know where itis using color images with eyes. In contrast, the light haslittle influence on depth images.

B. Evaluation

Different methods to process depth images are comparedfirst. As shown in Table II, normalized images and colorizedimages are better than single layer scaled images and triple-layer scaled images. Triple-layer scaled images outperformsingle layer images even though they have similar appear-ance. One reason is that the triple-layer images use trans-fer learning from pre-trained models aiming at recognitionproblems. Another reason is that the network can learn moredetailed information from three layers. The performance ofusing normalized images is almost the same with usingcolorized ones in position, and outperforms in orientation.Maybe the reason is that the CNN here pays more attention

TABLE II: Relocalization performance in night-time envi-ronment with different kinds of input for our CNN.

Dataset Median translation error Median rotation error

Dim color 1.24m 29.10◦

Single depth 0.58m 10.52◦

Triple depth 0.51m 8.66◦

Normalized depth 0.41m 7.15◦

Colorized depth 0.40m 7.49◦

to relative shape structures than absolute depth differencesin orientation. Color images are also used to generate poses,but the performance is disappointed compared with depthimages in night-time.

Labeled position for training images(red) and test im-ages(blue) and predicted position for test dataset(green) areplotted in Fig. 5. Given the consideration of both conve-nience of depth images preprocessing and performance of thesystem, we choose to use colorized images as the networkinput. The area size for experiments is about 4× 4× 1m3.Predicted poses from the deep convolutional neural networkusing depth images are around the ground truth from Vicon.The error is about 0.40m in translation and 7.49◦ in rotation.The processing time per image is about 5ms which meetsthe requirement of real-time.

V. CONCLUSIONS

In this paper, we have introduced a novel night-timeindoor relocalization method using depth images with a deepconvolutional neural network. The CNN introduced herecan generate the poses of camera in an end-to-end manner.Different preprocessing methods of depth images for poseregression neural networks are compared. In the future, wewould like to investigate more accurate and robust networksfor pose regression.

Fig. 4: Depth and color images collected in different situations including dim, night and daytime environment. (a) Depthimage with a window in Robot Arena lab. (b) Color image with a window and light off in Robot Arena lab. (c) Color imagewith a window and light on in Robot Arena lab. (d) Depth image with no windows in Robot Arena lab. (e) Color imagewith no windows and light off in Robot Arena lab. (f) Color image with no windows and light on in Arena lab. (g) Depthimage without windows in Robot Arena lab. (h) Color image without windows in the night in Robot Arena lab. (i) Colorimage without windows in the daytime in Robot Arena lab.

ACKNOWLEDGMENT

The authors would like to thank Robin Dowling for hissupport in experiments. We thank Sen Wang from Universityof Oxford, Ian Lenz from Cornell University and JiaxiangWu from Chinese Academy of Sciences for their valuablecomments and suggestions. The first three authors have beenfinancially supported by scholarship from China ScholarshipCouncil.

REFERENCES

[1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization andmapping: Part I,” Robotics & Automation Magazine, IEEE, vol. 13,no. 2, pp. 99–110, 2006.

[2] T. Bailey and H. Durrant-Whyte, “Simultaneous localization andmapping: Part II,” IEEE Robotics & Automation Magazine, vol. 13,no. 3, pp. 108–117, 2006.

[3] Z. Zhang, “Microsoft kinect sensor and its effect,” MultiMedia, IEEE,vol. 19, no. 2, pp. 4–10, 2012.

[4] J. Sell and P. O’Connor, “The xbox one system on a chip and kinectsensor,” IEEE Micro, no. 2, pp. 44–53, 2014.

[5] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand-

Fig. 5: Relocalization performance using depth images withdeep convolutional neural network. The 2D position oftraining images are displayed in red, blue points represent theground truth position of testing images, green points indicatepredicted position using the CNN in an end-to-end manner.

written zip code recognition,” Neural computation, vol. 1, no. 4, pp.541–551, 1989.

[6] Q. Liu, R. Li, H. Hu, and D. Gu, “Extracting semantic informationfrom visual data: A survey,” Robotics, vol. 5, no. 1, p. 8, 2016.

[7] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-DOF camera relocalization,” in Proceedingsof the IEEE International Conference on Computer Vision, 2015, pp.2938–2946.

[8] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learningfor camera relocalization,” arXiv preprint arXiv:1509.05909, 2015.

[9] P. J. Besl and N. D. McKay, “Method for registration of 3-D shapes,” inRobotics-DL tentative. International Society for Optics and Photonics,1992, pp. 586–606.

[10] B. Steder, G. Grisetti, and W. Burgard, “Robust place recognition for3D range data based on point features,” in Robotics and Automation(ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp.1400–1405.

[11] R. Cupec, E. K. Nyarko, D. Filko, A. Kitanov, and I. Petrovic, “Placerecognition based on matching of planar surfaces and line segments,”The International Journal of Robotics Research, vol. 34, no. 4-5, pp.674–704, 2015.

[12] E. Fernandez-Moral, P. Rives, V. Arevalo, and J. Gonzalez-Jimenez,“Scene structure registration for localization and mapping,” Roboticsand Autonomous Systems, vol. 75, pp. 649–660, 2016.

[13] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in Computer vision and pattern recognition, 2006 IEEE com-puter society conference on, vol. 2. IEEE, 2006, pp. 2161–2168.

[14] B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. Tardos,“A comparison of loop closing techniques in monocular SLAM,”Robotics and Autonomous Systems, vol. 57, no. 12, pp. 1188–1197,2009.

[15] M. Cummins and P. Newman, “Appearance-only SLAM at large scalewith FAB-MAP 2.0,” The International Journal of Robotics Research,vol. 30, no. 9, pp. 1100–1123, 2011.

[16] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robustfeatures,” in Computer vision–ECCV 2006. Springer, 2006, pp. 404–417.

[17] D. Galvez-Lopez and J. D. Tardos, “Bags of binary words for fastplace recognition in image sequences,” Robotics, IEEE Transactionson, vol. 28, no. 5, pp. 1188–1197, 2012.

[18] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” in Computer Vision–ECCV 2006. Springer, 2006, pp. 430–443.

[19] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robustindependent elementary features,” Computer Vision–ECCV 2010, pp.778–792, 2010.

[20] R. Mur-Artal and J. D. Tardos, “Fast relocalisation and loop closingin keyframe-based SLAM,” in Robotics and Automation (ICRA), 2014IEEE International Conference on. IEEE, 2014, pp. 846–853.

[21] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficientalternative to SIFT or SURF,” in Computer Vision (ICCV), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 2564–2571.

[22] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib-bon, “Scene coordinate regression forests for camera relocalization inRGB-D images,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2013, pp. 2930–2937.

[23] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting roboticgrasps,” The International Journal of Robotics Research, vol. 34, no.4-5, pp. 705–724, 2015.

[24] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige,N. Navab, and V. Lepetit, “Multimodal templates for real-time detec-tion of texture-less objects in heavily cluttered scenes,” in ComputerVision (ICCV), 2011 IEEE International Conference on. IEEE, 2011,pp. 858–865.

[25] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning richfeatures from RGB-D images for object detection and segmentation,”in Computer Vision–ECCV 2014. Springer, 2014, pp. 345–360.

[26] M. Schwarz, H. Schulz, and S. Behnke, “RGB-D object recognitionand pose estimation based on pre-trained convolutional neural networkfeatures,” in Robotics and Automation (ICRA), 2015 IEEE Internation-al Conference on. IEEE, 2015, pp. 1329–1335.

[27] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Bur-gard, “Multimodal deep learning for robust RGB-D object recog-nition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJInternational Conference on. IEEE, 2015, pp. 681–687.

[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2015, pp. 1–9.

[29] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprintarXiv:1312.4400, 2013.

[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 675–678.