guiding monocular depth estimation using depth-attention ...guiding monocular depth estimation using...

Guiding Monocular Depth Estimation UsingDepth-Attention Volume

Lam Huynh1[0000−0002−8311−1288], Phong Nguyen-Ha1[0000−0002−9678−0886], JiriMatas2[0000−0003−0863−4844], Esa Rahtu3[0000−0001−8767−0864], and Janne

Heikkilä1[0000−0003−0073−0866]

1 Center for Machine Vision and Signal Analysis, University of Oulu, Finland{lam.huynh,phong.nguyen,janne.heikkila}@oulu.fi

2 Center for Machine Perception, Czech Technical University, Czech [email protected]

3 Computer Vision Group, Tampere University, [email protected]

Abstract. Recovering the scene depth from a single image is an ill-posedproblem that requires additional priors, often referred to as monocu-lar depth cues, to disambiguate different 3D interpretations. In recentworks, those priors have been learned in an end-to-end manner fromlarge datasets by using deep neural networks. In this paper, we proposeguiding depth estimation to favor planar structures that are ubiquitousespecially in indoor environments. This is achieved by incorporating anon-local coplanarity constraint to the network with a novel attentionmechanism called depth-attention volume (DAV). Experiments on twopopular indoor datasets, namely NYU-Depth-v2 and ScanNet, show thatour method achieves state-of-the-art depth estimation results while usingonly a fraction of the number of parameters needed by the competingmethods. Code is available at: https://github.com/HuynhLam/DAV

Keywords: Monocular depth · Attention mechanism · Depth estima-tion.

1 Introduction

Depth estimation is a fundamental problem in computer vision due to its wide va-riety of applications including 3D modeling, augmented reality and autonomousvehicles. Conventionally it has been tackled by using stereo and structure frommotion techniques based on multiple view geometry [11,32]. In recent years, theadvances in deep learning have made monocular depth estimation a compellingalternative [2,5,8,10,13,19,20,24,26,27,28,40,44].

In learning-based monocular depth estimation, the basic idea is simply totrain a model to predict a depth map for a given input image, and to hopethat the model can learn those monocular cues that enable inferring the depthdirectly from the pixel values. This kind of a brute-force approach requires ahuge amount of training data and leads to large network architectures. It has

https://github.com/HuynhLam/DAV

2 L. Huynh et al.

been a common practice to use a deep encoder such as VGG-16 [5], ResNet-50[19,26,27], ResNet-101 [8], ResNext-101 [40], SeNet-154 [2,13] followed by someupsampling and fusion strategy including the up-projection module [19], multi-scale feature fusion [13] or adaptive dense feature fusion [2] that all result inbulky networks with a large number of parameters. Because high computationalcomplexity and memory requirements limit the use of these networks in practicalapplications, also fast monocular depth estimation models such as FastDepth [36]have been proposed, but their speed increase comes with the price of reducedaccuracy. Moreover, despite of good results achieved with standard benchmarkdatasets such as NYU-Depth-v2, it still remains questionable if these networksare able to generalize well to unseen scenes and poses that are not present in thetraining data.

Instead of trying to learn all the monocular cues blindly from the data, inthis paper, we investigate an approach where the learning is guided by exploit-ing a simple coplanarity constraint for scene points that are located on the sameplanar surfaces. Coplanarity is an important constraint especially in indoor en-vironments that are composed of several non-parallel planar surfaces such aswalls, floor, ceiling, tables, etc. We introduce a concept of depth-attention vol-ume (DAV) to aggregate spatial information non-locally from those coplanarstructures. We use both fronto-parallel and non-fronto-parallel constraints tolearn the DAV in an end-to-end manner.

It should be noticed that plane approximations have already been used pre-viously in monocular depth estimation, for example, in PlaneNet [24], where3D planes were explicitly segmented and estimated from the images, but incontrast to these works, we embed the coplanarity constraint implicitly to themodel by using the DAV, which is a building block inspired by the non-localneural networks [35]. Unlike the convolutional operation, it operates non-locallyand produces a weighted average of the features across the whole image payingattention on planar structures, and favoring depth values that are originatingfrom those planes. By using the DAV we not only incorporate an efficient andimportant geometric constraint to the model, but also enable shrinking the sizeof the network considerably without sacrificing the accuracy. To summarize, ourkey contributions include:

– A novel attention mechanism called depth-attention volume that capturesnon-local depth dependencies between coplanar points.

– An end-to-end neural network architecture that implicitly learns to recognizeplanar structures from the scene and use them as priors in monocular depthestimation.

– State-of-the-art depth estimation results on NYU-Depth-v2 and ScanNetdatasets with a model that uses considerably less parameters than previousmethods achieving similar performance.

Guiding Monocular Depth Estimation Using Depth-Attention Volume 3

Query point

Attention mapsDepthImage

Predicted

Groundtruth

Query pointQuery point Query point

Fig. 1. Visualization of depth-attention maps. The input image with four query pointsis shown on the left. The corresponding ground-truth and predicted depth maps are inthe middle. Because of the coplanarity prior the depth of the textureless white wall canbe accurately recovered. The ground-truth and predicted depth-attention maps for thequery points are on the right. Warm colour indicates strong depth prediction abilityfor the query point.

2 Related work

Learning-based monocular depth estimation: Saxena et al. [29] is one ofthe first studies using Markov Random Field (MRF) to predict depth from asingle image. Later on Eigen et al. proposed method to estimate depth usingmulti-scale deep network [6] and a multi-task learning model [5]. Since then,various studies using deep neural networks (DNNs) have been introduced. Lainaet al. [19] employed a fully convolutional residual network (FCRN) as the encoderand four up-projection modules as the decoder to up-sample the depth map res-olution. Fu et al. [8] successfully formulated monocular depth estimation as anordinal regression problem. Qi et al. [26] proposed a network called GeoNet thatinvestigate the duality between depth map and surface normal. The DNNs fromRen et al. [28] classified input images as indoor or outdoor before estimating thedepth values. Lee et al. [20] suggested an idea of using a DNNs to estimate therelative depth between pairs of pixels. The proposed method from Jiao et al. [15]incorporated object segmentation into the training to increase depth estimationaccuracy. Hu et al. [13] introduced an architecture that included an encoder, adecoder, a multi-scale feature fusion (MFF), and a new loss term for preservingedge structures. Inspired by [13], Chen et al. [2] used adaptive dense featurefusion (ADFF), and residual pyramid decoder in their network. The study byFacil et al. [7] proposed a DNNs that aims to learn calibration-aware patterns toimprove the generalization capabilities of monocular depth prediction. Recently,Ramamonjisoa et al. [27] presented SharpNet that exploits occluding contoursas an additional driving factor to optimize the depth estimation model besidesthe depth and the surface normal.

Plane-based approaches: Liu et al. [24] was the first study to consider usingthe planar constraint to predict depth maps from single images. Later the sameauthors published an incremental study to refine the quality of plane segmenta-tion [23]. Yin et al. [40] formed a geometric constraint called virtual normal topredict the depth map as well as a point cloud and surface normals. Note thatmethods by Liu et al. focused explicitly on estimating a set of plane parameters

4 L. Huynh et al.

Attention map

for every pixel

Image

8H

8W

Ground truth depth map H2

W2

H

W

Depth attention-volume

AD(0,0) AD(0,1) AD(0,2) AD(0,W)

AD(1,0) AD(1,1)

AD(2,0) AD(2,2)

AD(H,0) AD(H,W)

Fig. 2. Depth-attention volume (DAV) is a collection of depth-attention maps (Eq. 3,Figure 1) obtained using each image location as a query point at a time. Therefore,the DAV for an image of size 8H × 8W is a 4D tensor of size H ×W ×H ×W .

and planar segmentation masks, while Yin et al. calculated a large virtual planeto train a DNNs that is robust to noise in the ground truth depth.

Attention mechanism: Attention was initially used in machine translation andit was brought to computer vision by Xu et al. [39]. Since then, attention mech-anism has evolved and branched into channel-wise attention [12,33], spatial-wiseattention [1,35] and mix attention [34] in order to tackle object detection andimage classification problems. Some recent monocular depth estimation studiesalso followed this line of work. Xu et al. [38] proposed multi-scale spatial-wiseattention to guide a Conditional Random Fields (CRFs) model. Li et al. [22]proposed a discriminative depth estimation model using channel-wise attention.Kong et al. [18] embedded a discrete binary mask, namely the pixel-wise atten-tional gating unit, into a residual block to modulate learned features.

In this paper, we propose using depth-attention volume (DAV) to encodenon-local geometric dependencies. It can be seen as an attention mechanism thatguides depth estimation to favor depth values originating from planar surfacesthat are ubiquitous in man-made scenes. In contrast to previous plane-basedapproaches, we do not train the network to segment the planes explicitly, butinstead, we let the network to learn the coplanarity constraint implicitly.

3 Proposed Method

This section describes the proposed depth estimation method. The first subsec-tion defines the depth-attention volume and the following two subsections outlinethe network architecture and the loss functions. Further details are provided inthe supplementary material.

3.1 Depth-attention volume

Given two image points P0 = (x0, y0) and P1 = (x1, y1) with correspondingdepth values d0 and d1, we define that the depth-attention A(P0, P1) is the


ability of P1 to predict the depth of P0. This ability is quantified as a confidencein the range [0, 1] so that 0 means no ability and 1 represents maximum certaintyof being a good predictor.

To estimate A we make the assumption that the scene contains multiple non-parallel planes, which is common particularly in indoor environments. The depthvalues of all points belonging to the same plane are linearly dependent. Hence,they are good depth predictors of each other. To exploit this property, we detectN prominent planes from the training images and parameterize each plane withS = (nx, ny, nd, c), where (nx, ny, nd) is the plane normal and c is the orthogonaldistance from the origin. We construct the first-order depth-attention volumesfor all N planes:

Ai(P0, P1) = 1− σ(|Si ·X0|+ |Si ·X1|), i = 1, . . . , N (1)

where σ is the sigmoid function, X0 = (x0, y0, d0, 1) and X1 = (x1, y1, d1, 1).These volumes are represented as 4-D tensors of size H ×W ×H ×W , where Hand W are the vertical and horizontal sizes, respectively. In practice, one needsto subsample the volumes to keep the memory requirements reasonable. In allour experiments, we used a subsampling factor of 8.

In addition, we assume that all points located on the same fronto-parallelplane are good depth predictors of each other, because they share the samedepth value. We use the ground-truth depths, and create a zero-order depth-attention volume (DAV) for every training image

A0(P0, P1) = 1− σ(|d0 − d1|). (2)

Finally, we combine these volumes by taking the maximum attention valueof all volumes:

AD(P0, P1) = max(Ai(P0, P1)), i = 0, . . . , N (3)

It is easy to observe that DAV is a symmetric function, i.e. AD(P0, P1) =AD(P1, P0).

If we consider P0 to be a query point in the image as illustrated in Figure 1(left), we can visualize the DAV as a two-dimensional attention map shown inFigure 1 (right). Figure 2 provides an example of a depth-attention volumegenerated from the ground truth depth map.

3.2 Network Architecture

Figure 3 gives an overview of our model that includes three main modules: anencoder, a non-local depth-attention module, and a decoder.

We opt to use a simplified dilated residual networks (DRN) with 22 layers(DRN-D-22) [41,42] as our encoder, which extracts high-resolution features anddownsamples the input image only 8 times. The DRN-D-22 is a variation ofDRN that completely removes max-pooling layers as well as smoothly distributesthe dilation to minimize the gridding artifacts. This is crucial to our network,

6 L. Huynh et al.

because to make training feasible, the non-local depth-attention module needs tooperate on a sub-sampled feature space. However, to capture meaningful spatialrelationships this feature space also needs to be large enough.

The decoder part of our network contains a straightforward up-scaling schemethat increases the spatial dimension from 29×38 to 57×76 and then to 114×152.Upsampling consists of two bilinear interpolation layers followed by convolutionallayers with a kernel size of 3× 3. Two convolutional layers with a kernel size of5× 5 are then used to estimate the final depth map.

The non-local depth-attention module is located between the encoder andthe decoder. It maps the input features X to the output features Y of the samesize. The primary purpose of the module is to add the non-local informationembedded in the depth-attention volume (DAV) to Y, but it is also used topredict and learn the DAV based on the ground-truth data. The structure of themodule is presented in Figure 4.

We implement the DAV-predictor by first transforming X into green and blueembeddings using 1×1 convolution. We exploit the symmetry of DAV, and maxi-mize the correlation between these two spaces by applying cross-denormalizationon both green and blue embeddings. Cross-denormalization is a conditional nor-malization technique [4] that is used to learn an affine transformation from thedata. Specifically, the green embedding is first normalized to zero mean and unitstandard deviation using batch-normalization (BN). Then, the blue embeddingis convolved to create two tensors that are multiplied and added the normalizedfeatures from the green branch, and vise versa. The denormalized representationsare then activated with ReLUs and transformed by another 1 × 1 convolutionbefore multiplying with each others. Finally, the DAV is activated using the sig-moid function to ensure that the output values are in range [0, 1]. We empiricallyverified that applying cross-modulation in two embedding spaces is superior thanusing a single embedding with double the number of features.

Furthermore, X is fed into the orange branch and multiplied with the esti-mated DAV to amplify the effect of the input features. Finally, we add a residualconnection (red) to prevent the vanishing gradient problem when training ournetwork.

Non-localdepth-attention

module Decoder

attentionL depthL

Image Depth

Encoder

Fig. 3. The pipeline of our proposed network. An image is passed through the encoder,then the non-local depth-attention module, and finally the decoder to produce theestimated depth map. The model is trained using Lattention and Ldepth losses, whichare described in Subsection 3.3.


ReLU

C0 H W

C H W2

HW C2

C1 H W

C1 H W

C1 H W

C1 H W

C1 H W

C1 H W

C1 H W

C1 H W

C0 H WC0 H W

C1 HW

HW C1

Sigmoid

DAV-predictor

ReLUBN

DAV YHW HW

BN

BN

1 11 1

1 1

1 1

1 1

1 1

1 1

1 1

1 1

1 1X

Fig. 4. Structure of the non-local depth-attention module. “⊙

” presents element-wisemultiplication, “

⊕” presents element-wise sum, and “

⊗” is the outer product.

3.3 Loss Function

As illustrated in Figure 3 our loss function consists of two main components:attention loss and depth loss.Attention loss: The primary goal of this term is to minimize the error betweenthe estimated (output of the DAV-predictor in Figure 4) and the ground-truthDAV. The Lmae is defined as the mean absolute error between the predicted andthe ground truth depth-attention values:

Lmae =1

(HW )2

∑i

∑j

|Âi,j −Ai,j | (4)

where Âi,j ≡ ÂD(Pi, Pj) and Ai,j ≡ AD(Pi, Pj) are the predicted and groundtruth depth-attention volumes.

In addition, we minimize the angle between the predicted and the groundtruth depth-attention maps for all query positions i and j:

Lang =1

HW

∑i

∣∣∣∣∣∣1−∑

j Âi,jAi,j√∑j Â

2i,j

∑j A

2i,j

∣∣∣∣∣∣+∑j

∣∣∣∣∣∣1−∑

i Âi,jAi,j√∑i Â

2i,j

∑iA

2i,j

∣∣∣∣∣∣

(5)The full attention loss is defined by

Lattention = Lmae + λLang (6)where λ ∈ R+ is the weight loss coefficient.

Depth loss: Moreover, we define depth loss as a combination of three termsLlog, Lgrad and Lnorm that were originally introduced in [13]. The Llog loss is avariation of the L1 norm that is calculated in the logarithm space and definedas

Llog =1

M

M∑i=1

F (|d̂i − di|) (7)

8 L. Huynh et al.

where M is the number of valid depth values, di is the ground truth depth, d̂i isthe predicted depth, and F (x) = log(x+α) with α set to 0.5 in our experiments.

Another loss term is Lgrad, which is used to penalize sudden changes of edgestructures in both x and y directions. It is defined by

Lgrad =1

M

M∑i=1

F (∆x(|d̂i − di|)) + F (∆y(|d̂i − di|)) (8)

where ∆x and ∆y is the gradient of the error with respect to x and y. Finally,we use Lnorm to emphasize small details by minimizing the angle between theground truth (ni) and the predicted (n̂i) surface normals:

Lnorm =1

M

M∑i=1

|1− n̂i · ni|. (9)

where surface normals are estimated as n ≡ (−∇x(d), −∇y(d), 1) using Sobelfilter, like [13]. The depth loss is then defined by

Ldepth = Llog + µLgrad + θLnorm (10)

where µ, θ ∈ R+ are weight loss coefficients. Our full loss is

L = Lattention + γLdepth (11)

where γ ∈ R+ is a weight loss coefficient. Subsection 4.2 describes in detail howthe network is trained using these loss functions.

4 Experiments

In this section, we evaluate the performance of the proposed method by com-paring it against several baselines. We start by introducing datasets, evaluationmetrics, and implementation details. The last three subsections contain the com-parison to the state-of-the-art, ablation studies, and a cross-dataset evaluation.Further results are available in the supplementary material.

4.1 Datasets and evaluation metrics

Datasets: We assess the proposed method using NYU-Depth-v2 [30] and Scan-Net [3] datasets. NYU-Depth-v2 contains ∼ 120K RGB-D images obtained from464 indoor scenes. From the entire dataset, we use 50K images for training andthe official test set of 654 images for evaluation. ScanNet dataset comprises of2.5 million RGB-D images acquired from 1517 scenes. For this dataset, we usethe training subset of ∼ 20K images provided by the Robust Vision Challenge2018 [9] (ROB). Unfortunately, the ROB test set is not available, so we reportthe results on the Scannet official test set of 5310 images instead. SUN-RGBDis yet another indoor dataset consisting of ∼ 10K images collected with fourdifferent sensors. We do not use it for training, but only for cross-evaluating thepre-trained models on the test set of 5050 images.


Evaluation metrics: The performance is assessed using the standard met-rics provided for each dataset. That is, for NYU-Depth-v2 [30] we calculate themean absolute relative error (REL), root mean square error (RMS), and thresh-olded accuracy (δi). For the ScanNet and SUN-RGBD dataset, we provide themean absolute relative error (REL), mean square relative error (sqREL), scale-invariant mean squared error (SI), mean absolute error (iMAE), and root meansquare error (iRMSE) of the inverse depth values. For iBims-1 benchmark [17],we compute 5 similar metrics as for NYU-Depth-v2 plus the root mean squareerror in log-space (log10), planarity errors (�plan, �orie), depth boundary errors(�acc, �comp), and directed depth error (�0, �−, �+). Detailed definitions of themetrics are provided in the supplementary material.

4.2 Implementation Details

The proposed model is implemented with the Pytorch [25] framework, andtrained using a single Tesla-V100, batch size of 32 images, and Adam optimizer[16] with (β1, β2, �) = (0.9, 0.999, 10

−8). The training process is split into threeparts. During the first phase, we replace the DAV-predictor (Figure 4) with theDAVs computed from the ground truth depth maps. We train the model for200 epochs using only the depth loss (Eq. 10) and the learning rate of 10−4. Inthe second phase, we add the DAV-predictor to the model, freeze the weightsof other parts of the model, and train for 200 epochs with the learning rate of7.0× 10−5. In the last phase, we train the entire model for 300 epochs using thelearning rate of 7.0×10−5 for the first 100 epochs and then reduce it at the rateof 5% per 25 epochs. The last two stages employ the full loss function definedin Equation (11). We set all the weight loss coefficients λ, µ, θ, and γ as 1.

We augment the training data using random scaling ([0.875, 1.25]), rota-tion ([-5.0, +5.0] degrees), horizontal flip, rectangular window droppings, andcolorization. Planes, required for training, are obtained by fitting a paramet-ric model to the back-projected 3D point cloud using RANSAC with the inlierthreshold of 1 cm. We select at most the best N-planes in terms of the inliercount with a maximum of 100 iterations. Furthermore, we keep only planes thatcover more than 7% of the image area.

4.3 Comparison with the state-of-the-art

In this section, we compare the proposed approach with the current state-of-the-art monocular depth estimation methods.

NYU-Depth-v2: Table 1 contains the performance metrics on the officialNYU-Depth-v2 test set for our method and for [2,5,8,10,13,19,20,24,26,27,28,40,44].In addition, the table shows the number of model parameters for each method.The performance figures for the baselines are obtained using the pre-trainedmodels provided by the authors [2,8,13,24,27,40] or from the original papers ifthe model was not available [5,10,19,20,26,28,44]. Methods indicated with ?? and‡ are trained using the entire training set of 120K images or with external data,

10 L. Huynh et al.

Table 1. Evaluation results on the NYU-Depth-v2 dataset. Metrics with ↓ mean loweris better and ↑ mean higher is better. Timing is the average over 1000 images using aNVIDIA GTX-1080 GPU, in frames-per-second (FPS).

Methods #params Memory FPS REL↓ RMS↓ δ1↑ δ2↑ δ3↑Eigen’15 [5]?? 141.1M - - 0.215 0.907 0.611 0.887 0.971

Laina’16 [19]?? 63.4M - - 0.127 0.573 0.811 0.953 0.988

Liu’18 [24]‡ 47.5M 124.6MB 93 0.142 0.514 0.812 0.957 0.989

Fu’18 [8] ?? 110.0M 489.1MB 42 0.115 0.509 0.828 0.965 0.992

Qi’18 [26] 67.2M - - 0.128 0.569 0.834 0.960 0.990

Hao’18 [10] 60.0M - - 0.127 0.555 0.841 0.966 0.991

Lee’19 [20] 118.6M - - 0.131 0.538 0.837 0.971 0.994

Ren’19 [28] ?? 49.8M - - 0.113 0.501 0.833 0.968 0.993

Zhang’19 [44] 95.4M - - 0.121 0.497 0.846 0.968 0.994

Ramam.’19 [27]‡ 80.4M 336.6MB 47 0.139 0.502 0.836 0.966 0.993

Hu’19 [13] 157.0M 679.7MB 15 0.115 0.530 0.866 0.975 0.993

Chen’19 [2] 210.3M 1250.9MB 12 0.111 0.514 0.878 0.977 0.994

Yin’19 [40] 114.2M 437.6MB 37 0.108 0.416 0.875 0.976 0.994

Ours 25.1M 96.1MB 218 0.108 0.412 0.882 0.980 0.996

respectively. For instance, Ramamonjisoa et al. [27] trained the method usingsynthetic dataset PBRS [43] before fine-tuning on NYU-Depth-v2. The best per-formance is achieved by the proposed model that also contains the least amountof parameters. The best performing baselines, Yin et al. [40], Hu et al. [13], andChen et al. [2], have 4.5, 6.2, and 8.3 times more parameters compared to ours,respectively. Figure 5 provides an additional illustration of the model parameterswith respect to the performance.

Figure 6 shows qualitative examples of the obtained depth maps. In this case,the maps for the baseline methods are produced using the pre-trained modelsprovided by the authors. The method by Eigen and Fergus [5] performs well onuniform regions, but has difficulties in detailed structures. Laina et al. [19] pro-duces overly smoothed depth maps losing many small details. In contrast, Fu et

Accuracy & model size Absolute relative error & model size

Abs

olut

e re

lativ

e er

ror (

%)

Params (millions)

Accu

racy

δ1 (

%)

0.105

0.695

0.765

0.835

0.905

0.62550 100 150 200

Params (millions)50 100 150 200

0.127

0.149

0.171

0.193

0.215

OursEigen'15Laina'16Liu'18Fu'18Qi'18

Lee'19Zhang'19Ramamonjisoa'19Hu'19Chen'19Yin'19Hao'18

Ren'19

OursEigen'15Laina'16Liu'18Fu'18Qi'18

Lee'19Zhang'19Ramamonjisoa'19Hu'19Chen'19Yin'19Hao'18

Ren'19

Fig. 5. Analyzing the accuracy δ1(%) and mean absolute relative error (%) with respectto the number of parameters (millions) for recent monocular depth estimation methodson NYU-Depth-v2. The left picture presents the thresholded accuracy where highervalues are better, while the right picture shows the absolute relative error where lowervalues are better.


Image

Groundtruth

Eigen’15[5]

Laina’16[19]

Fu’18[8]

Ramam.’19[27]

Hu’19[13]

Chen’19[2]

Yin’19[40]

Ours

Fig. 6. Qualitative results on the official NYU-Depth-v2 [30] test set from differentmethods. The color indicates the distance where red is far and blue is close. Ourestimated depth maps are closer to the ground truth depth when comparing withstate-of-art methods.

al. [8] returns many details, but with the expense of discontinuities inside objectsor smooth areas. The depth images by Ramamonjisoa et al. [27] contain noiseand are prone to miss fine details. Yin et al. [40], Hu et al. [13], and Chen et al.[2] provide the best results among the baselines. However, they have difficultiese.g. on the third (near the desk and table) and the fourth examples from theleft (wall area). We provide further qualitative examples in the supplementarymaterial.

ScanNet: Table 2 contains the performance figures on the official ScanNet testset for our method, Ren et al. [28] (taken from the original paper), Hu et al. [13]and Chen et al. [2]. We use the public code from [2,13] to train their models.Unfortunately, the other baselines do not provide the results for ScanNet official

12 L. Huynh et al.

Table 2. Evaluation results on ScanNet [3].

Architecture #params REL sqREL SI iMAE iRMSE Test set

CSWS E ROB [21] 65.8M 0.150 0.060 0.020 0.100 0.130ROBDORN ROB [8] 110.0M 0.140 0.060 0.020 0.100 0.130

DABC ROB [22] 56.6M 0.140 0.060 0.020 0.100 0.130

Hu’19 [13] 157.0M 0.139 0.081 0.016 0.100 0.105

OfficialChen’19 [2] 210.3M 0.134 0.077 0.015 0.093 0.100Ren’19 [28] 49.8M 0.138 0.057 - - -Ours 25.1M 0.118 0.057 0.015 0.089 0.097

test set. Moreover, the test set used in the Robust Vision Challenge (ROB) isnot available at the moment and we are unable to report our performance onthat. Nevertheless, we have included the best methods from the ROB challengein Table 2 to provide indicative comparison. Note that all methods are trainedwith the same ROB training split. The proposed model outperforms [28] with aclear margin in terms of REL. The results are also substantially better comparedto ROB challenge methods, although the comparison is not strictly fair dueto different test splits. Figure 7 provides qualitative comparison between ourmethod and [2,13,22], using the sample images provided in [22]. The geometricstructures and details are clearly better extracted by our method.

OursImage Groundtruth Hu'19 [13] DABCROB [22] Chen'19 [2]

Fig. 7. Predicted depth maps from our model with baselines on the official ScanNet[3] test set.

Table 3. The iBims-1 benchmark

Method REL↓ log10↓ RMS↓ δ1↑ δ2↑ δ3↑ �plan↓ �orie↓ �acc↓ �comp↓ �0↑ �−↓ �+↓Liu’18 [24] 0.29 0.17 1.45 0.41 0.70 0.86 7.26 17.24 4.84 8.86 71.24 28.36 0.40

Ramam.’19 [27] 0.26 0.11 1.07 0.59 0.84 0.94 9.95 25.67 3.52 7.61 84.03 9.48 6.49

Ours 0.24 0.10 1.06 0.59 0.84 0.94 7.21 18.45 3.46 7.43 84.36 6.84 6.27


Planarity error analysis: We also evaluated our method on the iBims-1benchmark [17] and compared it with two recent works [24,27]. The results,shown in Table 3, indicate that we outperform the baselines in most of the met-rics, including plane related ones. Extensive planarity analysis is provided in thesupplementary material.

4.4 Ablation studies

Firstly, we assess how the number of prominent planes, used to estimate theground truth DAVs in the training phase, affects the performance (see Sec. 3.1).To this end, we train our model using the fronto-parallel planes (see Eq. 2) plusthree, five, and seven non-fronto-parallel planes (N in Eq. 1). The correspondingresults for the NYU-Depth-v2 test set are provided in Table 4. One can observethat the results improve by increasing the number of planes up to five anddecrease after that. Possible explanation for this could be that the images usedin the experiments do not typically contain more than five significant planes thatcan predict the depth values reliably. We also re-trained our model without thenon-local depth attention (DAV) module (and any planes) and the performancedegraded substantially as shown in Table 4.

Secondly, we study the impact of the attention loss term (Eq. 6). For thispurpose, we first train our model with and without the attention loss, and thencontinue training by dropping the attention loss after convergence. We reportthe results in Table 5. The model without the attention loss has clearly inferiorperformance indicating the importance of this loss term. Furthermore, continuingtraining by dropping the attention loss also degrades the performance.

Table 4. Performance of our model using different types of depth-attention volume.

DAV-types REL↓ RMS↓ δ1↑ δ2↑ δ3↑w/o DAV-module 0.140 0.577 0.827 0.960 0.989

||-Plane-DAV 0.116 0.442 0.867 0.976 0.9953-Plane-DAV 0.110 0.421 0.879 0.978 0.995

5-Plane-DAV 0.108 0.412 0.882 0.980 0.996

7-Plane-DAV 0.111 0.447 0.851 0.970 0.993

Table 5. Ablation studies of models without and with the attention loss on the NYU-Depth-v2. This shows the importance of the DAV in guiding the monocular depthmodel.

Training REL↓ RMS↓ δ1↑ δ2↑ δ3↑w/o Lattention 0.126 0.540 0.841 0.967 0.992w/ full loss 0.108 0.412 0.882 0.980 0.996

continue w/o Lattention 0.109 0.415 0.882 0.979 0.995

14 L. Huynh et al.

Table 6. Cross-dataset evaluation with training on NYU-Depth-v2 and testing onSUN-RGBD.

Models #params REL sqREL SI iMAE iRMSE

w/o DAV-module 17.5M 0.254 0.416 0.035 0.111 0.091

Hu’19 [13] 157.0M 0.245 0.389 0.031 0.108 0.087

Chen’19 [2] 210.3M 0.243 0.393 0.031 0.102 0.069

Ours 25.1M 0.238 0.387 0.030 0.104 0.075

Image Ground truth Ours Hu'19 Chen'19

Fig. 8. Direct results on SUN RGB-D dataset [31] without fine-tuning. Some regionsin the white boxes show missing or incorrect depth values from the ground truth data.

4.5 Cross-dataset evaluation

To assess the generalisation properties of the model, we perform a cross-datasetevaluation, where we train the network using NYU-Depth-v2 and test with SUN-RGBD [14,31,37] without any fine-tuning. We also evaluate the baseline methodsfrom [2,13] and report the results in Table 6. As can be seen our model performsfavourably compared to the other methods. Figure 8 contains a few examplesof the results with the SUN-RGBD dataset. One can observe that our model isable to well estimate the geometric structures and details of the scene despite thedifferences in data distributions between the training and testing sets. Moreover,we evaluated our model without the DAV-module in the same cross-datasetsetup. The results, shown in Table 6, clearly demonstrates that the DAV-moduleimproves the generalization.

5 Conclusions

This paper proposed a novel monocular depth estimation method that incorpo-rates a non-local coplanarity constraint with a novel attention mechanism calleddepth-attention volume (DAV). The proposed attention mechanism encouragesdepth estimation to favor planar structures, which are common especially in in-door environments. The DAV enables more efficient learning of the necessarypriors, which results in considerable reduction in the number of model parame-ters. The performance of the proposed solution is state-of-the-art on two popularbenchmark datasets while using 2-8 times less parameters than competing meth-ods. Finally, the generalisation ability of the method was further demonstratedin cross dataset experiments.


References

1. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convo-lutional networks. In: Proceedings of the IEEE International Conference on Com-puter Vision. pp. 3286–3295 (2019)

2. Chen, X., Chen, X., Zha, Z.J.: Structure-aware residual pyramid network formonocular depth estimation. In: Proceedings of the 28th International Joint Con-ference on Artificial Intelligence. pp. 694–700. AAAI Press (2019)

3. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet:Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 5828–5839 (2017)

4. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style.In: International Conference on Learning Representations ICLR (2017)

5. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. In: Proceedings of the IEEE in-ternational conference on computer vision. pp. 2650–2658 (2015)

6. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image usinga multi-scale deep network. In: Advances in neural information processing systems.pp. 2366–2374 (2014)

7. Facil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: Cam-convs: camera-aware multi-scale convolutions for single-view depth. In: Proceedingsof the IEEE conference on computer vision and pattern recognition. pp. 11826–11835 (2019)

8. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regressionnetwork for monocular depth estimation. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 2002–2011 (2018)

9. Geiger, A. and Nießner, M. and Dai, A.: Robust Vision Challenge CVPR Workshop(2018)

10. Hao, Z., Li, Y., You, S., Lu, F.: Detail preserving depth estimation from a singleimage using attention guided networks. In: 2018 International Conference on 3DVision (3DV). pp. 304–313. IEEE (2018)

11. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridgeuniversity press (2003)

12. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of theIEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)

13. Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation:Toward higher resolution maps with accurate object boundaries. In: IEEE WinterConf. on Applications of Computer Vision (WACV) (2019)

14. Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Darrell, T.: Acategory-level 3d object dataset: Putting the kinect to work. In: Consumer depthcameras for computer vision, pp. 141–165. Springer (2013)

15. Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: Monocular depthestimation with semantic booster and attention-driven loss. In: Proceedings of theEuropean Conference on Computer Vision (ECCV). pp. 53–69 (2018)

16. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

17. Koch, T., Liebel, L., Fraundorfer, F., Körner, M.: Evaluation of cnn-based single-image depth estimation methods. In: Leal-Taix, L., Roth, S. (eds.) European Con-ference on Computer Vision Workshop (ECCV-WS). pp. 331–348. Springer Inter-national Publishing (2018)

16 L. Huynh et al.

18. Kong, S., Fowlkes, C.: Pixel-wise attentional gating for scene parsing. In: 2019IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1024–1033. IEEE (2019)

19. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depthprediction with fully convolutional residual networks. In: 2016 Fourth internationalconference on 3D vision (3DV). pp. 239–248. IEEE (2016)

20. Lee, J.H., Kim, C.S.: Monocular depth estimation using relative depth maps. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 9729–9738 (2019)

21. Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion ofdilated CNNs and soft-weighted-sum inference. Pattern Recognition 83, 328–339(2018)

22. Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas-sification network for robust depth prediction. In: Asian Conference on ComputerVision. pp. 663–678. Springer (2018)

23. Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: Planercnn: 3d plane detectionand reconstruction from a single image. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4450–4459 (2019)

24. Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: Planenet: Piece-wise planarreconstruction from a single rgb image. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 2579–2588 (2018)

25. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E.,DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deeplearning library. In: Advances in Neural Information Processing Systems 32, pp.8024–8035. Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.

pdf

26. Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: Geometric neural network forjoint depth and surface normal estimation. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 283–291 (2018)

27. Ramamonjisoa, M., Lepetit, V.: Sharpnet: Fast and accurate recovery of occludingcontours in monocular depth estimation. The IEEE International Conference onComputer Vision (ICCV) Workshops (2019)

28. Ren, H., El-khamy, M., Lee, J.: Deep robust single image depth estimation neuralnetwork using scene understanding. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops. pp. 37–45 (2019)

29. Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images.In: Advances in neural information processing systems. pp. 1161–1168 (2006)

30. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and supportinference from rgbd images. In: European Conference on Computer Vision. pp.746–760. Springer (2012)

31. Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding bench-mark suite. In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 567–576 (2015)

32. Szeliski, R.: Structure from motion. In: Computer Vision, pp. 303–334. Springer(2011)

33. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.:Mnasnet: Platform-aware neural architecture search for mobile. In: Proceedings of

http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf


the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2820–2828(2019)

34. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.:Residual attention network for image classification. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 3156–3164 (2017)

35. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 7794–7803 (2018)

36. Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: Fastdepth: Fast monocu-lar depth estimation on embedded systems. In: 2019 International Conference onRobotics and Automation (ICRA). pp. 6101–6108. IEEE (2019)

37. Xiao, J., Owens, A., Torralba, A.: Sun3d: A database of big spaces reconstructedusing sfm and object labels. In: Proceedings of the IEEE International Conferenceon Computer Vision. pp. 1625–1632 (2013)

38. Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attentionguided convolutional neural fields for monocular depth estimation. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3917–3925 (2018)

39. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,Bengio, Y.: Show, attend and tell: Neural image caption generation with visualattention. In: International conference on machine learning. pp. 2048–2057 (2015)

40. Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtualnormal for depth prediction. In: The IEEE International Conference on ComputerVision (ICCV) (2019)

41. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In:International Conference on Learning Representations (ICLR) (2016)

42. Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedingsof the IEEE conference on computer vision and pattern recognition. pp. 472–480(2017)

43. Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J.Y., Jin, H., Funkhouser, T.:Physically-based rendering for indoor scene understanding using convolutional neu-ral networks. The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2017)

44. Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propaga-tion across depth, surface normal and semantic segmentation. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. pp. 4106–4115(2019)

Guiding Monocular Depth Estimation Using Depth-Attention Volume

guiding monocular depth estimation using depth-attention ...guiding monocular depth estimation using...

Documents