depth estimation for general scenes arxiv:2004.06267v1 [cs.cv] … · arxiv:2004.06267v1 [cs.cv] 14...

RealMonoDepth: Self-Supervised MonocularDepth Estimation for General Scenes

Mertalp Ocal and Armin Mustafa

Center for Vision, Speech and Signal ProcessingUniversity of Surrey, UK

{m.ocal,a.mustafa}@surrey.ac.uk

Abstract. We present a generalised self-supervised learning approachfor monocular estimation of the real depth across scenes with diversedepth ranges from 1–100s of meters. Existing supervised methods formonocular depth estimation require accurate depth measurements fortraining. This limitation has led to the introduction of self-supervisedmethods that are trained on stereo image pairs with a fixed camerabaseline to estimate disparity which is transformed to depth given knowncalibration. Self-supervised approaches have demonstrated impressive re-sults but do not generalise to scenes with different depth ranges or camerabaselines. In this paper, we introduce RealMonoDepth a self-supervisedmonocular depth estimation approach which learns to estimate the realscene depth for a diverse range of indoor and outdoor scenes. A novelloss function with respect to the true scene depth based on relative depthscaling and warping is proposed. This allows self-supervised training ofa single network with multiple data sets for scenes with diverse depthranges from both stereo pair and in the wild moving camera data sets. Acomprehensive performance evaluation across five benchmark data setsdemonstrates that RealMonoDepth provides a single trained networkwhich generalises depth estimation across indoor and outdoor scenes,consistently outperforming previous self-supervised approaches.1,2

Keywords: Monocular Depth Estimation, Self-Supervised

1 Introduction

Humans comprehend 3D from a single viewpoint by leveraging the knowledgeof context together with the shape and appearance priors. Similar to humanvisual perception, robust computer vision systems require the ability to per-ceive the environment in 3D. This fact has motivated research in monoculardepth estimation. The problem to recover depth from a single image is ill-poseddue to the projection ambiguity. Supervised methods have been proposed toestimate depth from a monocular image, demonstrating promising results, bytraining on a large amount of dense ground-truth depth data, however, this

1Code will be released upon publication.2https://youtu.be/6ot3hy3rGaA

arX

iv:2

004.

0626

7v1

[cs

.CV

] 1

4 A

pr 2

020

https://youtu.be/6ot3hy3rGaA

2 Mertalp Ocal and Armin Mustafa

Large Depth Range Medium Depth Range Small Depth Range

Fig. 1. Top Row: Scenes with different depth range: left, outdoor, range ≈ 500m;middle, outdoor, range ≈ 25m; and right, meeting room, range (≈ 10m). MiddleRow: Depth from proposed method. Bottom Row: Depth histogram of ground-truth.

is expensive and impractical to acquire for real-world scenes [12,14,15]. An al-ternative approach is to generate synthetic data by rendering from computergenerated models [41,5,49,58] or 3D scans [8,10,1], but it is challenging to createdata that represents the variety and detail of real-world appearance. Also trans-fer from training on synthetic data to real scenes remains a challenging openproblem. Instead of regressing depth from raw pixels, self-supervised learningmethods reformulate depth estimation as an image reconstruction problem byre-synthesising a target view from a single source view without ground-truthdepth [16,18]. Commonly these methods use stereo image pairs for training witha fixed camera baseline and require scenes with a fixed depth range. These meth-ods learn to estimate inter-image disparity and then use camera calibration toestimate depth.

Learning monocular depth across diverse scenes is a challenging problem,due to large changes in depth range. Typically indoor scenes have a depth rangeof < 10m, whereas outdoor scenes are commonly 100s of meters. Fig. 1 showssome typical scenes with different depth ranges. Monocular depth estimationshould be able to estimate depth across scenes with a wide variation in thedepth range. It is infeasible to train a deep network that can regress depthvalues from raw pixels when the output space is incompatible. Existing self-supervised methods can be trained only on data sets with similar depth ranges[19,63,37,62,40], limiting the number of images that can be used for training.As a result, they demonstrate poor generalisation performance and can onlyperform specific tasks, such as depth estimation in outdoor driving scenes witha fixed stereo baseline. Recently, some supervised approaches [32,33,7] addressedthis issue by normalizing the ground-truth to have the same scale which allowslearning relative depth values on moving camera data sets with a predefined

RealMonoDepth 3

Single Image

Input

Relative DepthNet

Relative Depth Map

Scale Transform

Real Depth Map

(a) (b) (c)

Warping

Loss

Fig. 2. Overview of RealMonoDepth approach for sample input images. For a givensingle view input image (a) the network predicts the (b) relative depth map, which isscaled and warped to obtain (c) an estimate of the real depth to enable self-supervisedtraining across scenes with diverse depth ranges.

depth range. However, these methods still depend on ground-truth depth valuesin order to estimate depth from an image.

In this paper, we propose a self-supervised method, RealMonoDepth, thatallows a single network to estimate depth for indoor and outdoor scenes demon-strating improved accuracy over previous self-supervised monocular depth esti-mation approaches. The proposed network learns real depth from stereo-pair andmoving camera data sets for scenes with a diverse depth range and without afixed camera baseline. An overview of our approach for scenes with varying depthranges (small, medium and large) is depicted in Fig. 1. To enable self-supervisedlearning of depth estimation across multiple scene scales, we introduce a novelloss function to learn relative depth together with the real depth. The networklearns the real depth by transferring relative depth through scaling and warp-ing which is used to compute reconstruction loss for self-supervised training, asshown in Fig. 2. The contributions are of this work are:

– A self-supervised monocular depth estimation method that is able to gener-alise learning across scenes with different depth range.

– A novel loss function over real depth for self-supervised learning of depthfrom a single image.

– Evaluation on five benchmark data sets demonstrates generalisation acrossindoor and outdoor scenes with improved performance over previous work.

2 Related Work

Although it is not possible to perform metric reconstruction from a single imagedue to the projective scale ambiguity, recently proposed learning-based methodshave demonstrated that a reliable estimate of the scene geometry can be gener-ated by having prior knowledge of the scale of objects in the scene. This sectionreviews the supervised, unsupervised and self-supervised approaches that take asingle RGB image as input and estimate per-pixel depth as output.


2.1 Supervised Depth Estimation

Supervised single image dense depth prediction methods exploit depth valuesobtained from active sensors such as Kinect and LIDAR as ground-truth fortraining. Eigen et al. [14] and Fu et al. [15] exploited regression loss to estimatedepth for indoor and outdoor scenes respectively. Mayer et al. [38] demonstratedthat training a fully convolutional network [36] is an effective approach to learndisparity from stereo images. They also generated a large synthetic data setto train their network called DispNet, demonstrating improved performance onscenes with diverse depth ranges. However the network trained on syntheticdata gives limited performance on real in the wild scenes. Huang et al. [23] pro-posed a deep CNN that leverages multi-view stereo images with known cameraposes and calibration to generate patch plane sweep volumes with respect toa reference view. Matching is performed between these patches and producesthe inverse disparity map for the reference view. Instead of generating planesweep volumes, Yao et al. [60] applied differential homography warping on 2Dfeature maps learned from multiple input views to produce a 3D cost volume.However, these methods suffer from the limited availability of multi-view data ofreal scenes with ground-truth depth, thereby relying heavily on synthetic data totrain their network, which leads poor generalisation capability on complex realscenes. Hence these supervised methods require ground-truth depth or syntheticdata for monocular depth estimation on real scenes.

2.2 Unsupervised and Self-Supervised Depth Estimation

Recently, unsupervised and self-supervised methods for depth estimation havegained attention eliminating the requirement of ground-truth depth data for realscenes. Unsupervised methods simultaneously estimate the depth and pose froma single RGB image and self-supervised methods exploit camera pose informationestimated as a pre-process to obtain depth from a single monocular image.

Zhou et al. [63] pioneered the work in unsupervised depth estimation byproposing separate deep CNN networks for pose estimation between unlabelledvideo sequences and for single view depth estimation. Instead of training an ad-ditional pose estimator network, Wang et al. [54] implemented a differentiableversion of Direct Visual Odometry Space (DVO) [51] which is popularly usedin current SLAM [39,11] algorithms. Specifically, DVO solves for the pose byminimizing the warping loss of the reference frame from the source frame giventhe reference frame depth. Furthermore, they introduced a depth normalisationlayer in order to address the scale ambiguity problem, which significantly outper-formed [63]. Inspired by the relation between depth, pose and optical flow tasksin 3D scene geometry, Yin et al. [61] jointly trained an unsupervised end-to-enddeep network to predict pose and depth for non-rigid objects. An optical flowconsistency check is imposed between backward and forward flow estimationsfor reliable estimation. However all these unsupervised methods give a limitedperformance on general scenes because of the ambiguity in the projection scaleintroduced by both unknown depth and pose.

RealMonoDepth 5

Self-supervised methods exploit known camera pose information to resolvethe depth ambiguity given estimates of disparity. Garg et al. [16] introduced aself-supervised learning approach to train a deep network from stereo pairs byexploiting the epipolar relation between the cameras given a known calibrationin order to generate inverse warp of the left view to reconstruct the right view.Although their results are impressive, they use non-differentiable Taylor seriesexpansion to perform warping of disparity. Godard et al. [18,19] imposed left-right consistency as a constraint for disparity regularisation and established adifferentiable optimisation by leveraging spatial transformer networks [24] forbilinear sampling. Given a stereo pair as input for training, they estimate twodisparity maps: left view disparity with respect to right view and right viewdisparity with respect to left view. Then, they reconstruct both views and alsothe disparity maps to achieve left-right consistency. Moreover, they also imposededge-aware smoothness as another regulation along with left-right consistencyand leveraged SSIM [56] loss in addition to L1 reconstruction loss. Poggi etal. [42] proposed an improvement to [18] by leveraging three rectified views in-stead of stereo in order to establish additional disparity consistency. Inspired byrecent successes of deep learning in single image super-resolution, Pillai et al.[40] employ sub-pixel convolutional layers instead of resizing convolution layers.Contrary to previous methods that are limited to low-resolution operation, theirmethod exploits high fidelity for better self-supervision. Existing self-supervisedapproaches estimate disparity and assume a fixed camera baseline during train-ing. This limits the approaches to training for scenes with a similar depth rangeand does not allow generalisation across diverse indoor and outdoor scenes, ordiverse data sets for training. In this paper, we introduce an approach to self-supervised learning using a loss function based on estimates of the true scenedepth. This allows generalisation across both stereo pair and moving cameradata sets for scenes with different depth ranges.

Both self-supervised and unsupervised methods suffer from the following limi-tations: requirement of fixed camera baseline; limited generalisation performanceon scenes with varying depth ranges (indoor/outdoor); and self-supervised meth-ods estimate disparity and only work for training on stereo image pairs with afixed baseline which limits the training set. The proposed self-supervised depthestimation method addresses all of these limitations by generalising learningacross scenes with different depth ranges and works for both stereo and movingcamera data sets improving generalisation and accuracy of depth estimation overprevious approaches.

3 Method

This paper introduces a self-supervised single image depth estimation approachthat is able to generalise learning across scenes with diverse depth ranges. Themethod is trained on both stereo image pairs and moving camera data sets, giv-ing state-of-the-art performance across five benchmark data sets. The methodestimates depth from a single view image, an overview is shown in Fig. 2. The


Scale

Transform

Image

Warping

Relative

DepthNetI1

Scale

Transform

Relative

DepthNetI2

Shared

Weights

D1,Rel

D2,Rel

D1

D2

Depth Map

Warping

Depth Map

Warping

I'1

D'2

D'1

I'2

L1,ph

L2,gc

L1,gc

L2,ph

I2

Image

WarpingI1

Fig. 3. Training framework for proposed self-supervised single image depth estimation

proposed network is trained on two views of a static scene. The Relative Depth-Net network estimates relative depth maps from two views, inspired from [19]which estimates disparity maps between stereo pair. As a pre-process, cameracalibration is estimated from sparse correspondences between two views usingan existing visual SFM method COLMAP [48]. The sparse correspondences areused to estimate the median scene depth for scale transformation, which is ap-plied on the relative depth maps to obtain real depth maps/true scene depth.The real depth maps and images are warped between views using the calibrationto estimate the loss. This enables the network to be trained on moving cameraand stereo data sets of real scenes with varying depth ranges and camera viewbaselines to generalise performance across a variety of indoor and outdoor scenesin the wild.

3.1 Training Framework

The training framework for the proposed approach is illustrated in Fig. 3 forlearning single view depth estimation from two viewpoint images. Given twoimages of a scene from different viewpoints (I1, I2), the depth network (RelativeDepthNet) predicts the corresponding per-pixel relative depth maps (D1,Rel,D2,Rel) using shared weights. The relative depth maps are transformed to realdepth (D1, D2) using the scale transform module. The self-supervised loss isthen computed using the warped real depth estimates (D1, D2) and warpedimages (I2, I1). Using estimated calibration and camera poses, real depth valuesallow us to reconstruct the input images (I ′1, I ′2) and depth maps (D′1, D′2). Thisinformation is interpolated to compute photometric loss (Lph) and geometricconsistency (Lgc) loss, that supervises the depth network. SSIM and smoothnesslosses are introduced to regulate the depth estimation. The loss function for theproposed method is:

RealMonoDepth 7

Lfinal =∑s

λphLsph + λgcL

sgc + λssimL

sssim + λssL

ssmooth (1)

where s indexes over different image scales and λph, λgc, λssim and λsmooth arethe weighting terms. Extension to training with multiple viewpoint images (¿2)is straightforward (using loss Li,ph, Li,gc based on the estimated real scene depthDi for each input image i). While existing self-supervised methods are only suitedfor training on fixed baseline cameras and scenes with similar depth range, weovercome this limitation by stabilizing the output space of the depth network,allowing training on indoor and outdoor scenes with diverse depth ranges.

Relative DepthNet Architecture:Based on the U-Net Architecture [44], in order to effectively capture both lo-cal and global information, we use a multi-scale encoder-decoder network withskip connections, similar to [18]. Inspired by [21,19,27], we select ResNet50 [22]for the encoder with initialised weights pre-trained on ImageNet [46] and ran-domly initialise decoder weights. Unlike popular self-supervised/unsupervisedsingle depth estimation approaches [18,19,37,63,40,31], our network predicts rel-ative depth instead of inverse depth or disparity. We observe that applying sig-moid activation at the network output slows convergence for close and far depthvalues due to the vanishing gradient problem. Instead, we replace sigmoid withidentity activation and handle negative values with exponential mapping, de-tailed in the Scale Transform section. Apart from multi-scale output layers ofthe decoder, we apply batch normalisation and ReLU nonlinearities in all layers.A detailed explanation with the full details of the network architecture requiredfor implementation is presented in the supplementary material.

Scale Transform:To learn to estimate depth from images across diverse scenes with varying depthranges, we normalise depth across the images and data sets using a non-linearscale transform and train the network to estimate relative depth. Given therelative depth map prediction as input, our scale transform module outputs thereal depth map. This is formulated as:

Dk = µke(Dk,Rel) for k = 1, 2 (2)

where µk is the median depth value for two images I1 and I2, D1 and D2 arethe real scene depth maps. Inspired by supervised methods [14,15], we use ex-ponential mapping in equation 2 in order to reduce the penalisation of the deepnetwork from distant and ambiguous depth values.

During training, camera calibration is required to estimate the median depthvalue for the scale transform. For data sets with unknown calibration (e.g. Man-nequin data set), off-the-shelf SFM method COLMAP [48] is used to solve forcamera calibration and sparse correspondences between views. If the cameracalibration is known (e.g. KITTI data set), we use the calibration to computesparse correspondences between views. The sparse correspondences are triangu-lated in 3D exploiting camera pose to get a sparse reconstruction of the scenes.


The sparse 3D points are projected on each view to obtain sparse depth maps foreach viewpoint. Depth values are then sorted and the median depth value is es-timated, as shown in Fig. 1. Median depth enables prediction of real depth mapswhich are used together with the input images to estimate the loss in Equation 1.

Loss Functions:Photometric consistency: This loss enables estimation of depth and is in-spired by self-supervised learning approaches that reformulate depth estimationas an image reconstruction problem [16,18]. Here, the underlying intuition is thatevery viewpoint is a 2D projection of the same 3D scene, so one view can bereconstructed from another view, which implies knowledge about depth. Withknown calibration between views, depth is treated as an intermediate variable toperform novel-view synthesis with a deep network. Here, I ′1 is the reconstructedreference view from I2 and I ′2 is the reconstructed reference view from I1. Let K1,K2 matrices represent the intrinsic parameters of I1 and I2 respectively, P1→ 2,P2→ 1 denote relative camera pose matrices between views, D′2(p′(2)), D′1(p′(1))represent projected depth values for each pixel p. Then, the homogeneous pixel-wise projection relations (D′proj()) between two views can be formulated as:

D′proj,2(p′(2))p′(2) = K2P1→ 2D1(p(1))K−11 p(1),

D′proj,1(p′(1))p′(1) = K1P2→ 1D2(p(2))K−12 p(2)(3)

Since the projected pixel values are continuous, we use a spatial transformernetwork [24] to perform differentiable bilinear sampling in order to approximateI ′1 and I ′2 by interpolating I1(p(1)) and I2(p(2)) from neighboring corner pixels:

I ′1(p(1)) = I2(p(2)) =∑(i,j)

w(2)(i,j)I2(p

′(2)(i,j)) , I ′2(p(2)) = I1(p(1)) =

∑(i,j)

w(1)(i,j)I1(p

′(1)(i,j))

Here, (i, j) are the indices of the top left, top right, bottom left and bottomright pixels of the projected pixels and w(i,j) is the corresponding weight, whichis inversely proportional to spatial distance. Also, due to occlusions betweenviewpoints, some pixels in the reference view will be projected outside of theimage plane boundary in the source view. In order to prevent these unresolvedregions from penalising the network, similar to Mahjourian et al. [37], a validitybinary masks is computed to exclude these pixels from the training loss [19].Hence, the L1 image reconstruction loss for two views k = 1, 2 becomes:

Lph =1

N

N∑p

∑k=1,2

Lk,ph(p) where, Lk,ph(p) = |Ik(p)− I ′k(p)|Mk(p) (4)

Also, SSIM [56] loss is applied to 3x3 image patches in order to regulate thenoisy artifacts caused by the L1 loss, defined as follows:

Lssim =1

N

N∑p

[(1− SSIM(I1, I′1)(p)) + (1− SSIM(I2, I

′2)(p))] (5)

RealMonoDepth 9

Smoothness: This loss term ensures the depth maps are smooth, reducing thenoise in the depth maps. Sobel gradients are used to calculate the loss insteadof horizontal and vertical gradients in Wong et al. [57]. Sobel gradients allowdepth to be smooth horizontally, vertically and diagonally. In most cases, regionsthat have higher reconstruction error are only visible in one view. Applyingsmoothness regularisation to these unresolvable regions induces a false penalty inthe network training. In order to overcome this problem, an adaptive smoothnessregulation weight is used for every pixel that varies in space, as in previouswork [57]. Based on Equation 4, the adaptive weight for a pixel is computed

as follows: αk(p) = exp(− c∗Lk,ph(p)σk

), where index k = 1, 2 corresponds to theview and c is the scale factor that determines the range of α and σ is the globalresidual represented as: σk = 1

1N

N∑pLk,ph(p)

. α depends on global residual (σ) at

each position and tends to be small when the residuals are high. The averageof α approaches to 1 as the training converges. Based on adaptive weight, thesmoothness regularisation objective for two views k = 1, 2 is:

Lsmooth =1

N

N∑p

∑k=1,2

Lk,smooth(p) (6)

Lk,smooth(p) = αk(p)(|∂xDk,rel(p)|exp(−∂xIk(p)) + |∂yDk,rel(p)|exp(−∂yIk(p))

Here, the depth values are enforced to be locally smooth which is represented asthe x and y Sobel gradients of the relative depth maps.Geometric consistency: The per-view predicted depth maps may not be con-sistent with the same 3D geometry. This causes depth discontinuities and out-liers on the surfaces and boundaries of the objects. In order to address thisissue, we learn the real depth map (D1, D2) for each viewpoint simultaneouslywith consistency checks. Inspired by Bian et al. [3], we enforce geometric consis-tency symmetrically by sampling different viewpoints in the same training batch.Based on Equation 3, we interpolate projected depth maps D′proj,1, D′proj,1 withbilinear sampling in order to approximate D′1(p), D′2(p) which lie on the pixelgrid. The geometric consistency loss function for k = 1, 2 is defined as follows:

Lgc =1

N

N∑p

(L1,gc(p) + L2,gc(p)) where Lk,gc(p) =|Dk(p)−D′k(p)|Dk(p) +D′k(p)

(7)

Here, similar to [3] we use a normalised symmetric loss function to achieve depthconsistency between predicted depth maps for each viewpoint.

4 Experiments

Qualitative and quantitative results are presented on five benchmark data setsagainst state-of-the-art supervised, unsupervised and self-supervised methods.


Table 1. Data sets for monocular depth estimation

data set Indoor Outdoor Dynamic Video Depth Diversity Annotation # Images

NYUDv2 [50] X X X Metric Low RGB-D 407KMake3D [47] X Metric Low Laser 534KITTI [17] X X X Metric Low Stereo 93KDIW [7] X X X Ordinal Pair High User clicks 496KCityscapes [9] X X X Metric Low Stereo 25KMegadepth [33] X X No scale Medium SFM 130KMC [32] X X X X No scale High SFM 115KTUM [52] X X X Metric Low RGB-D 80K

We demonstrate that the proposed self-supervised loss function using real depthdramatically improves generalisation performance when trained on both movingcamera (Mannequin Challenge (MC) [32] mostly indoor) and stereo (KITTI [17]outdoor) data sets jointly. These data sets contain both indoor (1–10m) andoutdoor (1–1000m) scenes with a wide variation in depth range. We test thesame trained model on four benchmark data sets which the network has notseen during training: KITTI Eigen test split [13] (street scenes), Make3D [47](outdoor buildings), NYUDv2 test split [50] (indoor) and dynamic subset ofTUM-RGBD [32] (humans in indoor environments). Key attributes of the datasets used in experiments are listed in Table 1. Additional qualitative results on awide variety of in the wild scene images for the DIW [7] data set are presented inthe supplementary material, together with comparative performance evaluationon a diverse range of challenging in the wild videos in the supplementary video.

4.1 Implementation Details

Our model is implemented in Tensorflow [2], trained using the Adam [26] opti-miser for 25 epochs with an input/output resolution of 512×256 and a batch sizeof 12. Each batch sample consists of different viewpoint images of the same scene.We set the number of viewpoint images as 2 which leads to 12× 2 = 24 imagesfor each batch. Initial learning rate is set to 10−4 and it is decayed with a lin-ear scheduler LR(iteration) = initialLR∗(1− iteration/maxIteration)0.9. Theweights for the loss terms are empirically determined as: λph = 0.15, λssim =0.85, λssmoothness = 0.01/s where s is the downsampling factor for each scale,λgc = 0.1 and scale factor for adaptive regularisation term is chosen as 5, similarto [57]. For data augmentation, we perform horizontal flipping, random scal-ing, cropping with 50% and apply random brightness, contrast, saturation onthe rest 50%, with the same range of values as in [19]. Full details of networkimplementation are given in the supplementary material.

4.2 Experimental Setup

We train our model based on data split of Eigen et al. [14] for KITTI and Li et al.[32] for Mannequin Challenge (MC) data sets. We perform both individual andmixed training on these data sets with and without our proposed scale transformapproach in order to evaluate the differences in generalisation capability and

RealMonoDepth 11

compare with previous methods. The underlying motivation for combining thesedata sets is threefold: 1) They are both large data sets suitable for self-supervisedtraining, 2) Their sizes are comparable which makes them favourable to mix forjoint training, 3) They represent different varieties of appearance: outdoor streetscenes in KITTI and humans (mostly indoors) in MC.

We select 23,488 stereo pairs of KITTI for training and the remaining 697images are used as test set for evaluating single view depth estimation. Similarto train split of [32], we select 2463 scenes of MC for training. Since some of thevideos on Youtube were deleted by the owners, we were not able to access all ofthe video URLs provided by [32], so our training set is slightly smaller. In order toensure well balanced class distribution, we randomly sample 40 viewpoint imagesfrom each scene. Images are resampled for scenes that have a lower number ofviewpoints.

We quantitatively evaluate the single view depth estimation of our modelfollowing the error metrics of Eigen et al. [14]: mean absolute relative error (AbsRel), mean squared relative error (Sq Rel), root mean squared error (RMS),and root mean squared log10 error (RMS(log)). Following Zhou et al. [63], wescale our relative single-view depth map predictions to match the median ofground-truth.

4.3 Comparison with State-of-the-art

This section presents quantitative results of the proposed scale transform methodmodels trained on various data set combinations (KITTI, MC, MC+KITTI).Note: due to estimation of disparity rather than depth and assumption of a fixedcamera baseline in previous self-supervised estimation methods [18,20,42,16], itis not possible to train for data sets with different scene scales.KITTI We report the results for the test set of KITTI based on Eigen split[14] in Table 2. Our model trained on MC+KITTI shows the best generalisationperformance outperforming all state-of-the-art supervised methods which arenot trained on KITTI and which are trained on large scale diverse data setssuch as Chen et al. [6] and Li et al. [33]. We also demonstrate that the trainingloss function based on real depth allows generalisation over data sets for indoorand outdoor scenes with different scales and allows the combination of datasets during training without degrading test performance. Our model trainedon MC+KITTI outperforms other state-of-the-art self-supervised/unsupervisedmethods on KITTI even when they are trained on KITTI.Make3D Next, we evaluate on the Make3D data set following the procedurein [19]. In Table 3, our model trained on MC+KITTI demonstrates the bestgeneralisation performance on an unseen data set compared to other methodswhich are [35,34] and are not [14,7] trained on Maked3D. Qualitative compar-isons of our method trained on MC+KITTI with training loss based on realdepth demonstrates improved performance in Fig.4.NYUDv2 Here, we show that our proposed method also generalises well toindoor scenes. We evaluate performance against state-of-the-art self-supervisedmonocular method Monodepth2[19] on the NYUDv2 test split. In Table 3, our


Table 2. Quantitative results on KITTI data set for different methods trained onvarious scenes (lower is better).

Method Supervision Training set Abs Rel Sq Rel RMS RMS(log)

Eigen [14] Depth KITTI 0.203 1.548 6.31 0.282Liu [34] Depth KITTI 0.202 1.614 6.52 0.275Fu [15] Depth KITTI 0.072 0.307 2.73 0.120Eigen [14] Depth NYU 0.521 5.016 10.37 0.510Liu [34] Depth NYU 0.540 5.059 10.10 0.526Laina [28] Depth NYU 0.515 5.049 10.07 0.527Liu [34] Depth Make3D 0.362 3.465 8.70 0.447Laina [28] Depth Make3D 0.339 3.136 8.68 0.422Chen [7] Depth DIW 0.393 3.260 7.12 0.474Li [33] Depth Megadepth 0.368 2.587 6.68 0.414

Garg [16] Pose KITTI 0.152 1.226 5.85 0.246Monodepth [18] Pose KITTI 0.148 1.334 5.93 0.247DDVO [54] KITTI 0.151 1.257 5.58 0.228GeoNet [61] KITTI 0.149 1.060 5.57 0.226Struct2depth [4] KITTI 0.141 1.026 5.29 0.215Zhou [63] KITTI 0.208 1.768 6.86 0.283Zhou [63] CS 0.267 2.686 7.58 0.3343Net [42] Pose KITTI 0.129 0.996 5.28 0.223Monodepth2 [19](640x192) Pose KITTI 0.109 0.873 4.960 0.209Ours Pose KITTI 0.109 0.928 4.99 0.199Ours Pose MC 0.276 2.563 9.17 0.386Ours Pose MC+KITTI 0.108 0.855 5.15 0.204

Input Zhou et al. DDVO Monodepth2 Ours Groundtruth

Fig. 4. Qualitative results on the Make3D data set.

model outperforms Monodepth2 with a significant margin and achieves compet-itive accuracy against supervised methods that are trained on a different split ofthe same data set. We also provide qualitative comparisons of our model trainedon MC+KITTI in Fig. 5 demonstrating improved performance.TUM Finally, we present quantitative and qualitative results tested on the dy-namic subset of the TUM-RGBD data set in Table 4 and Fig. 6 respectively.Our model ranks second-best compared to supervised methods and best for self-supervised methods. In order to make a fair comparison with [32], we only includetheir result trained on a single image without any additional prior knowledgeexcept the depth ground-truth.

4.4 Ablation Study

Here, we evaluate the effect on generalisation performance across scenes with dif-ferent depth ranges for our self-supervised training using a loss function based

RealMonoDepth 13

Table 3. Quantitative results on Make3D and NYUDv2 data sets for different methodstrained on various scenes (lower is better).

Make3D NYUDv2Method Supervision Training set Abs Rel RMS Method Supervision Training set Abs Rel RMS

Xu [59] Depth Make3D 0.184 4.38 Xu [59] Depth NYU 0.121 0.586Li [30] Depth Make3D 0.278 7.19 Li [29] Depth NYU 0.139 0.505Laina [28] Depth Make3D 0.176 4.45 Laina [28] Depth NYU 0.129 0.583Liu [34] Depth Make3D 0.314 8.60 Liu [34] Depth NYU 0.230 0.824Liu [35] Depth Make3D 0.335 9.49 Liu [35] Depth NYU 0.335 1.06Laina [28] Depth NYU 0.669 7.31 Eigen [14] Depth NYU 0.215 0.907Liu [34] Depth NYU 0.669 7.20 Eigen [13] Depth NYU 0.158 0.641Eigen [14] Depth NYU 0.505 6.89 Roy [45] Depth NYU 0.187 0.744Chen [7] Depth DIW 0.550 7.25 Wang [55] Depth NYU 0.220 0.745Li [33] Depth Megadepth 0.402 6.23 Jafari [25] Depth NYU 0.157 0.673

Monodepth [18] Pose KITTI 0.525 9.88 Monodepth2 [19] Pose KITTI 0.342 1.183Monodepth2 [19] Pose KITTI 0.322 7.42 Ours Pose KITTI 0.300 1.005Zhou [63] KITTI 0.651 8.39 Ours Pose MC 0.201 0.718DDVO [54] KITTI 0.387 8.09 Ours Pose MC + KITTI 0.193 0.686Ours Pose KITTI 0.295 7.10Ours Pose MC 0.346 7.70Ours Pose MC+KITTI 0.289 6.92

Input Monodepth2 Ours w/o ST Ours with ST GT

Fig. 5. Qualitative results on NYU data set.

Table 4. Quantitative results on TUM Dynamic Objects RGBD data set for differentmethods trained on various scenes. Lower is better.

Method Supervision Training set Abs Rel RMS

Xu [43] Depth NYU 0.274 1.085Laina [28] Depth NYU 0.223 0.947Chen [7] Depth NYU+DIW 0.262 1.004DeMoN [53] Depth TUM RGBD+MVS 0.220 0.866Fu [15] Depth NYU 0.194 0.925Li [32](single image) Depth MC 0.204 0.840

Monodepth2 [19] Pose KITTI 0.427 1.616Ours Pose KITTI 0.388 1.533Ours Pose MC 0.207 0.973Ours Pose MC+KITTI 0.201 1.025

on real depth estimates. For models trained without our proposed method, weomit the scene median depth term, so the scale transform module is modifiedas D = exp(Drel) where Drel is the estimated relative depth of our RelativeDepthNet network and D is the real depth map which is used for warping theother viewpoint image in order to compute the loss for training. We train ournetwork with four different configurations regarding training data sets and usageof proposed scale transform (ST) during training: 1) MC w/o ST, 2) MC+KITTI


Input Monodepth2 Ours w/o ST Ours with ST GT

Fig. 6. Qualitative results on TUM data set.

w/o ST, 3) MC w/ ST and 4) MC+KITTI w/ ST. Then, we test each of thesefour models on four benchmark data sets similar to section 4.3. We report nu-merical ablation results in Table 5 and show qualitative comparisons betweenmodels trained on MC+KITTI with and without scale transform in Fig. 5, Fig.6 and Fig. 7. Both our models trained on MC+KITTI with and without scaletransform achieve similar numerical results on test split of KITTI. This is rea-sonable since the KITTI data set is collected with stereo cameras that have afixed baseline between them. On the other hand, the MC data set consists ofdiverse scenes with no fixed baseline between viewpoint images in each scene.For other test sets, models trained with the proposed loss function significantlyoutperform the models that are trained without the scale transform. Moreover,our proposed method also allows training on a combination of different data setsto generalise across scene depth ranges for indoor and outdoor scenes.

Input InputOurs w/o ST Ours w/o STOurs with ST Ours with ST

Fig. 7. Qualitative results on Mannequin Challenge data set.

Table 5. Results on two different test sets with and without proposed scale transformmethod. Lower is better for all error measures.

Test set Error MeasureTraining Set

Test set Error MeasureTraining Set

MC MC + KITTI MC MC + KITTIw/o ST ST w/o ST ST w/o ST ST w/o ST ST

KITTIAbs Rel 0.426 0.276 0.116 0.108

NYUAbs Rel 0.282 0.201 0.333 0.193

Sq Rel 4.933 2.563 0.864 0.855 RMS 0.921 0.718 1.347 0.686

Make3DAbs Rel 0.424 0.346 0.347 0.289

TUM DynamicAbs Rel 0.252 0.207 0.273 0.201

RMS 8.771 7.70 7.64 6.92 RMS 1.050 0.973 1.133 1.025

RealMonoDepth 15

5 Conclusion and Future Work

We present a generalised self-supervised monocular depth estimation method(RealMonoDepth) that overcomes the limitation of existing self-supervised meth-ods ([20,63,18,61,16]) that are limited to scenes with fixed scale and depth range.These methods cannot be trained on moving camera data sets due to the as-sumption of a fixed baseline and are unable to generalise to unseen data sets withdifferent depth ranges. RealMonoDepth addresses all of these limitations by al-lowing simultaneous training on a combination of indoor and outdoor scenes withvarying depth ranges. This leads to significantly improved generalisation perfor-mance across indoor and outdoor scenes and scenes which are unseen duringtraining, and removes the dependance on a fixed camera baseline. The proposedmethod allows mixing stereo and moving camera data sets (MC + KITTI) im-proving on state-of-the-art performance in single view depth estimation acrossfive benchmark data sets including data sets with varying depth range.

Success of deep networks depends on the use of large data sets. The proposedself-supervised training from sequences captured from a single camera allows usto train the network on diverse uncontrolled in the wild data sets, such as theMannequin Challenge (MC) data set used in this work. This opens the doorto further generalisation through training across even larger and more diversescenes. A limitation of the proposed method is that it works with static scenesduring training as with all the other single-image depth estimation methods.At test time we estimate depth from a single image and are therefore able tohandle dynamic scenes with a wide variety of scale (see supplementary video).An interesting potential future work might be to extend the training of ourmethod for dynamic scenes to increase the diversity of the data.

References

1. Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data formultiple-view stereopsis. International Journal of Computer Vision pp. 1–16 (2016)

2. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-mawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machinelearning. In: 12th {USENIX} Symposium on Operating Systems Design and Im-plementation ({OSDI} 16). pp. 265–283 (2016)

3. Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.: Unsupervisedscale-consistent depth and ego-motion learning from monocular video. In: Advancesin Neural Information Processing Systems. pp. 35–45 (2019)

4. Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without thesensors: Leveraging structure for unsupervised learning from monocular videos. In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8001–8008 (2019)

5. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich3d model repository. arXiv preprint arXiv:1512.03012 (2015)

6. Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards scene understanding:Unsupervised monocular depth estimation with semantic-aware representation. In:


Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 2624–2632 (2019)

7. Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild.In: Advances in neural information processing systems. pp. 730–738 (2016)

8. Choi, S., Zhou, Q.Y., Miller, S., Koltun, V.: A large dataset of object scans.arXiv:1602.02481 (2016)

9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban sceneunderstanding. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3213–3223 (2016)

10. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet:Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 5828–5839 (2017)

11. Davison, A.J.: Real-time simultaneous localisation and mapping with a single cam-era. In: null. p. 1403. IEEE (2003)

12. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., VanDer Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu-tional networks. In: Proceedings of the IEEE international conference on computervision. pp. 2758–2766 (2015)

13. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. In: Proceedings of the IEEE in-ternational conference on computer vision. pp. 2650–2658 (2015)

14. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image usinga multi-scale deep network. In: Advances in neural information processing systems.pp. 2366–2374 (2014)

15. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regressionnetwork for monocular depth estimation. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 2002–2011 (2018)

16. Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depthestimation: Geometry to the rescue. In: European Conference on Computer Vision.pp. 740–756. Springer (2016)

17. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kittivision benchmark suite. In: 2012 IEEE Conference on Computer Vision and PatternRecognition. pp. 3354–3361. IEEE (2012)

18. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth es-timation with left-right consistency. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 270–279 (2017)

19. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 3828–3838 (2019)

20. Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild:Unsupervised monocular depth learning from unknown cameras. arXiv preprintarXiv:1904.04998 (2019)

21. Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill-ing cross-domain stereo networks. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 484–500 (2018)

22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)

RealMonoDepth 17

23. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learningmulti-view stereopsis. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 2821–2830 (2018)

24. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. pp. 2017–2025 (2015)

25. Jafari, O.H., Groth, O., Kirillov, A., Yang, M.Y., Rother, C.: Analyzing modularcnn architectures for joint depth prediction and semantic segmentation. In: 2017IEEE International Conference on Robotics and Automation (ICRA). pp. 4620–4627. IEEE (2017)

26. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

27. Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monoc-ular depth map prediction. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 6647–6655 (2017)

28. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depthprediction with fully convolutional residual networks. In: 2016 Fourth internationalconference on 3D vision (3DV). pp. 239–248. IEEE (2016)

29. Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion ofdilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328–339(2018)

30. Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M.: Depth and surface normalestimation from monocular images using regression on deep features and hierarchi-cal crfs. In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 1119–1127 (2015)

31. Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: Monocular visual odometry throughunsupervised deep learning. In: 2018 IEEE International Conference on Roboticsand Automation (ICRA). pp. 7286–7291. IEEE (2018)

32. Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learningthe depths of moving people by watching frozen people. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 4521–4530 (2019)

33. Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internetphotos. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 2041–2050 (2018)

34. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimationfrom a single image. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 5162–5170 (2015)

35. Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a singleimage. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 716–723 (2014)

36. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semanticsegmentation. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3431–3440 (2015)

37. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. pp. 5667–5675(2018)

38. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,T.: A large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 4040–4048 (2016)


39. Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.: Fastslam: A factoredsolution to the simultaneous localization and mapping problem. Aaai/iaai 593598(2002)

40. Pillai, S., Ambrus, R., Gaidon, A.: Superdepth: Self-supervised, super-resolvedmonocular depth estimation. In: 2019 International Conference on Robotics andAutomation (ICRA). pp. 9250–9256. IEEE (2019)

41. Planche, B., Wu, Z., Ma, K., Sun, S., Kluckner, S., Lehmann, O., Chen, T., Hutter,A., Zakharov, S., Kosch, H., et al.: Depthsynth: Real-time realistic synthetic datageneration from cad models for 2.5 d recognition. In: 2017 International Conferenceon 3D Vision (3DV). pp. 1–10. IEEE (2017)

42. Poggi, M., Tosi, F., Mattoccia, S.: Learning monocular depth estimation with un-supervised trinocular assumptions. In: 2018 International Conference on 3D Vision(3DV). pp. 324–333. IEEE (2018)

43. Ricci, E., Ouyang, W., Wang, X., Sebe, N., et al.: Monocular depth estimationusing multi-scale continuous crfs as sequential deep networks. IEEE transactionson pattern analysis and machine intelligence 41(6), 1426–1440 (2018)

44. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention. pp. 234–241. Springer (2015)

45. Roy, A., Todorovic, S.: Monocular depth estimation using neural regression forest.In: Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 5506–5514 (2016)

46. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision 115(3), 211–252 (2015)

47. Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a singlestill image. IEEE transactions on pattern analysis and machine intelligence 31(5),824–840 (2008)

48. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conferenceon Computer Vision and Pattern Recognition (CVPR) (2016)

49. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape bench-mark. In: Proceedings Shape Modeling Applications, 2004. pp. 167–178. IEEE(2004)

50. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and supportinference from rgbd images. In: European conference on computer vision. pp. 746–760. Springer (2012)

51. Steinbrucker, F., Sturm, J., Cremers, D.: Real-time visual odometry from densergb-d images. In: 2011 IEEE International Conference on Computer Vision Work-shops (ICCV Workshops). pp. 719–722. IEEE (2011)

52. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark forthe evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ International Conferenceon Intelligent Robots and Systems. pp. 573–580. IEEE (2012)

53. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.:Demon: Depth and motion network for learning monocular stereo. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5038–5047 (2017)

54. Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monoc-ular videos using direct methods. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. pp. 2022–2030 (2018)

RealMonoDepth 19

55. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unifieddepth and semantic prediction from a single image. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 2800–2809 (2015)

56. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality as-sessment: from error visibility to structural similarity. IEEE transactions on imageprocessing 13(4), 600–612 (2004)

57. Wong, A., Soatto, S.: Bilateral cyclic constraint and adaptive regularization forunsupervised monocular depth prediction. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 5644–5653 (2019)

58. Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: A benchmark for 3d objectdetection in the wild. In: IEEE winter conference on applications of computervision. pp. 75–82. IEEE (2014)

59. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous crfs assequential deep networks for monocular depth estimation. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 5354–5362(2017)

60. Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc-tured multi-view stereo. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 767–783 (2018)

61. Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow andcamera pose. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 1983–1992 (2018)

62. Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsu-pervised learning of monocular depth estimation and visual odometry with deepfeature reconstruction. In: Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition. pp. 340–349 (2018)

63. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depthand ego-motion from video. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 1851–1858 (2017)

Supplementary Material

1 Network Implementation Details

We use the standard pretrained resnet50 encoder, resnet v1 50.ckpt, officiallyprovided by Tensorflow [2]. The decoder weights are initialised randomly andthe details of our architecture are shown in Table 1. Our decoder uses skipconnections from the encoder [36] (econv4, econv3, econv2, econv1), econv5 isthe final encoder output and estimates multi-resolution depth maps (depth1,depth2, depth3, depth4) in order to exploit both local and global information toresolve higher resolution details. Code will be released upon publication.Table 1. Decoder architecture of Relative DepthNet. Upsample() represents2× nearest-neighbor resizing operation and + denotes channel-wise concatenation.

Layer Output Size Kernel Size Stride Input BatchNorm Activation

upconv5 16× 32× 256 3 1 Upsample(econv5) Yes ReLUiconv5 16× 32× 256 3 1 upconv5 + econv4 Yes ReLU

upconv4 32× 64× 128 3 1 Upsample(iconv5) Yes ReLUiconv4 32× 64× 128 3 1 Upsample(upconv4) + econv3 Yes ReLUdepth4 32× 64× 1 3 1 iconv4 No Identity

upconv3 64× 128× 64 3 1 Upsample(iconv4) Yes ReLUiconv3 64× 128× 64 3 1 upconv3 + econv2 + Upsample(depth4) Yes ReLUdepth3 64× 128× 1 3 1 iconv3 No Identity

upconv2 128× 256× 32 3 1 Upsample(iconv3) Yes ReLUiconv2 128× 256× 32 3 1 upconv2 + econv1 + Upsample(depth3) Yes ReLUdepth2 128× 256× 1 3 1 iconv2 No Identity

upconv1 256× 512× 16 3 1 Upsample(iconv2) Yes ReLUiconv1 256× 512× 16 3 1 upconv1 + Upsample(depth2) Yes ReLUdepth1 256× 512× 1 3 1 iconv1 No Identity

2 Additional Results

We provide additional qualitative results in order to showcase the generalizationability of the proposed model trained on MC+KITTI using our novel loss func-tion.Supplementary video. We generate depth predictions with our model andcompare against current state-of-the-art Monodepth2 [19] on sample YouTubevideos which consist of diverse scenes with dynamic objects and varying depthrange. Results are presented in the supplementary video. Note: the video scenesare monocular and were not seen during training. These videos are recorded withstandard handheld monocular cameras and do not have ground-truth depth es-timates. Each frame was processed independently i.e. the temporal relation isnot used.Diverse scene images. We also show qualitative results on the test set ofDIW [7] dataset in Fig. 1, 2 and 3. These images constitute diversely rich con-tent including indoor, natural and street scenes consisting of various objects

Supplementary Material 21

taken from arbitrary camera angles with uncontrolled lighting conditions andscene appearance. Results demonstrate plausible depth estimation for generalscenes with performance comparable to human or previous depth estimationusing supervised learning [7].

Fig. 1. Qualitative results on the DIW test set. (cont.)

22

Fig. 2. Qualitative results on the DIW test set.

Supplementary Material 23

Fig. 3. Qualitative results on the DIW test set.

depth estimation for general scenes arxiv:2004.06267v1 [cs.cv] … · arxiv:2004.06267v1 [cs.cv] 14...

Documents