waveletstereo: learning wavelet coefficients of disparity map in … · 2020. 6. 29. · ward...

WaveletStereo: Learning Wavelet Coefficients of Disparity Map in

Stereo Matching

Menglong Yang, Fangrui Wu and Wei Li

School of Aeronautics and Astronautics, Sichuan University

Chengdu, Sichuan, PR China

[email protected], [email protected], [email protected]

Abstract

Some stereo matching algorithms based on deep learn-

ing have been proposed and achieved state-of-the-art per-

formances since some public large-scale datasets were put

online. However, the disparity in smooth regions and de-

tailed regions is still difficult to accurately estimate simul-

taneously. This paper proposes a novel stereo matching

method called WaveletStereo, which learns the wavelet co-

efficients of the disparity rather than the disparity itself.

The WaveletStereo consists of several sub-modules, where

the low-frequency sub-module generates the low-frequency

wavelet coefficients, which aims at learning global context

information and well handling the low-frequency regions

such as textureless surfaces, and the others focus on the de-

tails. In addition, a densely connected atrous spatial pyra-

mid block is introduced for better learning the multi-scale

image features. Experimental results show the effectiveness

of the proposed method, which achieves state-of-the-art per-

formance on the large-scale test dataset Scene Flow.

1. Introduction

As a convenient and cheap means of obtaining object

depth, stereo vision acts a more and more pivotal part in

computer vision, with the increasing applications of vir-

tual (augmented) reality, 3D object detection and recog-

nition, motion sense game and unmanned aerial vehicle,

etc. Scholars have paid many attentions to stereo match-

ing, which is a key step for stereo vision. Recent years, the

research on stereo matching has got satisfying achievement

since some public datasets were published online, such as

Middlebury [39] and KITTI stereo benchmark [10, 12],

which is convenient for researchers to compare their algo-

rithms against state-of-the-art algorithms. However, stereo

matching for the complex environment including amounts

of ill-posed regions, such as textureless or detailed regions,

is still a challenging topic[32].

As Scharstein and Szeliski [39] summarized, there were

usually four steps in a typical traditional stereo matching al-

gorithm, i.e., matching cost computation, cost aggregation,

optimization, and disparity refinement, respectively. Tradi-

tional stereo matching algorithms can be mainly split into

two categories, i.e., the local methods and the global (semi-

global) methods.

Most local stereo matching methods were more inter-

ested at studying the first two steps [16, 53], and often

accurately estimated the disparity in regions with high-

frequency details, but frequently failed in low-frequency ar-

eas such as textureless and saturated regions.

To improve the performance in low-frequency region,

many global (semi-global) methods, such as graph cuts

[24, 36, 33], belief propagation (BP) [23, 52, 51, 60, 49] and

Semi-global matching (SGM) algorithm [15], focused more

on the research of latter two steps. A Conditional Random

Field (CRF) model was often constructed in global (semi-

global) algorithms, which contained the assumptions of

photo-consistency and smoothness. The photo-consistency

expects that the matching pixels have similar appearance

features, and the smoothness constrains the divergence

of neighboring pixels’ label except some disparity-jump

places, which measures the cost of assigning labels to

neighboring pixels, such as the pairwise smoothness [23,

52, 51, 60] or the higher-order smoothness [49, 47, 25].

Compared with local methods, these methods improved the

performance in the low-frequency regions, but solving the

CRF model is often time consuming and they may still

falsely predict the disparity for pixels in high-frequency re-

gions.

Recent works based on deep learning use convolutional

neural networks to learn similarity computation and con-

textual information, and vastly improve the accuracy and

robustness of disparity estimation [55, 20]. However, there

are a lot of mismatches in ill-posed regions, especially the

high-frequency regions, such as thin surfaces, occlusion ar-

eas and repeated patterns, although these regions account

for only a small proportion and the false predictions in them

12885

have minor impact on the total evaluation of accuracy. In-

tuitively, the disparity estimation in low-frequency regions

is more dependent on the global contextual information, but

the estimation in high-frequency regions depends more on

the image details. Therefore, it is difficult to train a network

to simultaneously predict these regions accurately. In order

to resolve this problem, this paper proposes a novel stereo

matching algorithm based on learning wavelet coefficients

of the disparity map, with main contributions summarized

as follows.

Firstly, we proposed an end-to-end architecture for

stereo matching, called WaveletStereo, which incorporates

a mechanism of multi-resolution wavelet reconstruction.

WaveletStereo contains several predictors of wavelet coef-

ficients, which estimate the wavelet coefficients with dif-

ferent resolutions of the disparity map and calculate the dis-

parity map through multi-resolution wavelet reconstruction,

where the low-frequency wavelet predictors focus on learn-

ing the global contextual information of the disparity map,

while the high-frequency wavelet predictors concentrate on

generating the details of the disparity map.

Secondly, we proposed a densely connected atrous spa-

tial pyramid block, which effectively captures multi-scale

contextual information with relatively few parameters. A

feature representation with a very deep and complex struc-

ture can benefit the final performance for the stereo algo-

rithms, but it can also increase the computational time and

the complexity of training. Based on the proposed block,

this work obtained good accuracy without the computa-

tional burden.

Finally, we adopted the proposed WaveletStereo algo-

rithm to achieve the state-of-the-art performance on the

large-scale stereo benchmark Scene Flow.

2. Related work

Recent years, the public stereo datasets [9, 38, 31] have

spawned many learning-based stereo algorithms. Early

learning-based stereo algorithms put the learning mecha-

nisms into the traditional stereo matching framework. For

example, some works[42, 14] trained a model to automat-

ically estimate the confidence of the computed matching

cost. As a pioneer of using convolutional neural network

(CNN) in stereo, J. Zbontar and Y. LeCun [55] used a CNN

to learn similarity computation for a pair of image patches

to improve the robustness to image noise and illumination

variation, and refined the disparity using a traditional Semi-

global matching algorithm [15]. It outperformed on KITTI

benchmark, although it frequently suffered from mismatch

in ill-posed regions. The extensive works [30, 48] dramat-

ically speed-up the calculation of matching cost by using

cosine similarity or Euclidean distance of the embedding

features to compute the matching cost, instead of the for-

ward propagation of several fully connected layers. Be-

sides learning matching cost, some methods [2, 50] added

a model to train smoothness cost, in order to reduce the

roughness and over-smoothness when using SGM algo-

rithm to refine the disparity. In addition, some other meth-

ods addressed refining matching cost [26, 27] or learning

the parameters of optimization model [56, 37, 38].

Many end-to-end learning methods have successively

achieved the state-of-the-art performances since the emer-

gence of Scene Flow [31], which is a large-scale synthetic

stereo dataset. Kendall et al. [20] used deep feature repre-

sentations to form a cost volume and adopted cost filtering

by a series of 3D convolutions to learn contextual informa-

tion. It achieved sub-pixel accuracy with a differentiable

soft argmin operation. Since then, many similar approaches

have been proposed. Yu et al. [54] used a mechanism of

multiple cost aggregation proposals to refine the cost vol-

ume. The pyramid stereo matching network (PSMNet) [5]

was proposed, where a spatial pyramid pooling module ag-

gregates context in different scales and locations to form

a cost volume, and a stacked multiple hourglass network

learns to regularize cost volume. Guo et al. [13] presented

a method called group-wise correlation to improve the rep-

resentations for measuring feature similarities. The idea of

left-right consistency check was adopted in LRCR [18] and

[4] to refine the disparity estimation. Song et al. [41] im-

proved the details in disparity maps by utilizing a multi-task

learning in conjunction with edge detection task. A two-

stages cascade CNN architecture was adopted in [35], in

which the first stage added the up-convolution modules into

DispNet [31] to improve the details in disparities maps and

the second stage learned the multi-scale residuals to further

refine the disparity. Similarly, Khamis et al. [21] employed

a learned edge-aware upsampling function to refine the dis-

parity predicted from a very low resolution cost volume.

The low resolution cost volume gives it speed advantage and

the upsampling operations refine details, but it is unlikely

to recover the details if the detailed signals are completely

missing from the low resolution cost volume. In addition,

some other works focused on self-supervised methods to

learn from the open-world unlabeled data [57, 59].

This work solves the stereo by learning the wavelet co-

efficients of disparity map, rather than directly learning the

disparity. Spectral approaches have been widely applied in

image processing tasks [29, 43, 3, 44] for many years. Re-

cently, there are some works exploring the possible advan-

tages of learning the filters by deep learning for image anal-

ysis in the wavelet domain [7] or incorporating a spectral

approach into CNNs [46]. Especially, Huang et al. proved

the feasibility of solving the over-smoothed problem and

improving the textural details in the image super-resolution

application, by learning the wavelet coefficients of the high-

resolution image. To the best of our knowledge we are

the first to study the wavelet learning algorithm for stereo

12886

matching.

3. WaveletStereo

We describe the proposed algorithm that predicts dispar-

ity for a rectified pair of stereo images in this section. The

architecture can be mainly divided into three modules in-

cluding deep representation, multi-resolution cost volumes

and multi-resolution reconstruction, as shown in Fig. 1. The

following sections respectively describe these main mod-

ules, the detailed network configuration is elaborated in the

supplementary material.

3.1. Deep representation

The deep representation aims at learning to encode lo-

cal and global contextual information, which extracts unary

features from a pair of stereo images with a shared weight

Siamese network, to form a cost volume. We bring the res-

olution to a fourth by using two downsampling modules,

each of which is followed by a densely connected atrous

spatial pyramid block in order to encode the contextual in-

formation more efficiently. The downsampling is simply

performed by using convolutions with 3× 3× 32 filters and

a stride of 2. The last layer of the deep representation is a

convolution operation with 3 × 3 × 32 filters, and the out-

puts are 14H× 1

4W×32 features for the left and right image,

respectively. Each convolution is followed by a batch nor-

malization and ReLU activation except the output layer.

Inspired by DenseNet [6, 17], we propose the densely

connected atrous spatial pyramid (DCASPP) block, the

structure of which is shown in Fig. 2. To learn different

scales of contextual information, we use multiple atrous

convolutions with different dilated rates and concatenate

their results to group an inception layer, similar to the op-

eration of atrous spatial pyramid pooling (ASPP) in [6]. In

addition, we adopt the densely connected structure to make

use of the advantages presented in [17], e.g. encouraging

feature reuse and substantially reducing the number of pa-

rameters, and further expanding the scale of learnt contex-

tual information without exaggeratively extending the di-

lated rate.

At the first downsampling step, we use two inception lay-

ers, each of which includes four atrous convolutions with

3×3×4 filters and dilated rates of 1, 2, 4 and 8, respectively.

For the second downsampling step, we use four inception

layers, each of which includes two atrous convolutions with

3× 3× 8 filters and dilated rates of 1 and 2, respectively.

3.2. Multiresolution cost volumes

The module of multi-resolution cost volumes is rela-

tively simple, which forms several cost volumes with differ-

ent resolutions from the unary features, and lays the foun-

dation for multi-resolution wavelet reconstruction.

In this work, we first construct a cost volume with a

fourth resolution by concatenating the left and right unary

features with shifted disparities, similar to [20], which is

followed by two consecutive downsampling operations. We

finally obtain three volumes with the resolutions of fourth,

eighth and sixteenth, respectively. The downsampling is

simply performed using 3D convolution with 3×3×3×32filters and stride of 2, which has certain effects of cost fil-

tering as well.

3.3. Multiresolution wavelet reconstruction

The third module learns the regressions from multi-

resolution cost volumes to the wavelet coefficients with dif-

ferent resolutions, performs the wavelet reconstruction level

by level through inverse wavelet transforms and finally ac-

quires a disparity map. The multi-resolution reconstruction

iteratively repeats a similar process, i.e., mapping a cost vol-

ume to the wavelet coefficients, which is finished by a CNN

composed of cost filtering and wavelet regression, as shown

in Fig. 3. Note it does not compute the low-frequency

wavelet approximation (the yellow regions) except for the

lowest resolution. The low-frequency wavelet approxima-

tion incorporates wider context, while the high-frequency

wavelet coefficients describe details for disparity map.

In this work, we use CNNs to learn the Haar wavelet co-

efficients of the disparity map, which are sufficient to depict

scene information of different frequencies. There are four

wavelet coefficients with the same resolution at each level,

i.e., the low-frequency approximation, the horizontal high

frequency wavelet, the vertical high frequency coefficient

and the diagonal high frequency information, respectively.

The four wavelet coefficients are used to reconstruct the

low-frequency approximation of the higher resolution by in-

verse wavelet transform, which is iteratively repeated until

the disparity map with the full resolution is reconstructed.

More specifically, a CNN is used to map the cost with the

lowest resolution (1/16) into the low-frequency wavelet ap-

proximation and corresponding high-frequency coefficients

with the resolution of 1/8 (it includes an upsampling opera-

tion in the CNN). Then the low-frequency wavelet approxi-

mation and high-frequency coefficients are used together to

get wavelet approximation with the higher resolution (1/4)

by inverse wavelet transform. Similar steps are iteratively

performed level by level until getting the disparity map with

the full resolution.

Here the cost filtering includes a series of 3D convolu-

tions and transposed convolutions, the network configura-

tion of which can be seen in the supplementary material in

detail. An important issue is described below, which is how

wavelet coefficients of the disparity map are regressed from

the filtered cost volumes.

Recently, a regression operation proposed in [20], i.e.,

soft argmin, is widely used in disparity estimation after cost

12887

Deep Representation Multi-resolution Wavelets ReconstructionInput Stereo Images

......

Multi-resolution Cost Volumes Disparity

......

Figure 1. WaveletStereo Network architecture pipeline.

Previous

Layer

Conv

（DR=2）

Conv

（DR=1）

Next

Layer

Conv

（DR=4）

Conv

（DR=2）

Conv

（DR=1）

Conv

（DR=4）

……

……

……

Concat

Concat

Concat

Figure 2. The structure of densely connected atrous spatial pyra-

mid module, where ”DR” denotes the dilated rate.

filtering. For each pixel i, the disparity di is regressed as a

weighted softmax function:

di =

Dmax∑

d=0

d×e−yi(d)

∑Dmax

d′=0 e−yi(d′)(1)

where yi is the filtered cost at pixel i and Dmax is the pre-

defined maximum disparity. The softmax operation to the

filtered cost at pixel i, i.e.,

pi(d) =e−yi(d)

∑Dmax

d′=0 e−yi(d′)(2)

can be regarded as the probability of di = d. The formula-

tion (1) can be abbreviated as

D =

Dmax∑

d=0

d× P (d) (3)

where D is the estimated disparity map, and P (d) is the

softmax operation to the filtered cost. Making a wavelet

transform to the disparity map, it has

f(D) = f

(

Dmax∑

d=0

d× P (d)

)

=

Dmax∑

d=0

d× f (P (d)) (4)

where f(·) is the wavelet transform. If the wavelet coeffi-

cient of the disparity map is denoted as ψ, it can be calcu-

lated as

ψ =

Dmax∑

d=0

d× ψ(d) (5)

The disparity map D can be reconstructed by inverse

wavelet transform with the wavelet coefficients, the esti-

mation of the disparity D can be thus transformed into the

prediction of the wavelet coefficients. In fact, ψ(d) in (5)

can be regarded as the wavelet coefficients of P (d). This

wavelet transform can be iteratively performed, i.e., multi-

resolution wavelet decomposition, which is shown as Figure

4.

If the maximum value of the disparity map is Dmax,

the value range of its low-frequency approximation at first

level decomposition is [0, 2Dmax], and the value range of

its high-frequency coefficients at first level decomposition is

[−Dmax, Dmax], after Haar decomposition. And so on, the

value ranges of low-frequency and high-frequency wavelet

coefficients at l-th level decomposition are [0, 2lDmax]and [−2l−1Dmax, 2

l−1Dmax], respectively. Therefore, the

wavelet coefficients can be calculated as weighted softmax

functions, i.e., the low-frequency coefficent at pixel i of

level l is

ψli =

Dmax∑

d=0

2ld×e−yi(d)

∑Dmax

d′=0 e−yi(d′), (6)

and the high-frequency coefficients at pixel i of level l are

ψli =

Dmax∑

d=0

2l−1d×

(

e−ǫi(d)

∑Dmax

d′=0 e−ǫi(d′)−

e−ηi(d)

∑Dmax

d′=0 e−ηi(d′)

)

,

(7)

where yi, ǫi and ηi are the filtered costs at pixel i. Note we

use two variables ǫi and ηi, the softmax values of which are

both [0, 2l−1Dmax], to ensure the right value range of the

high-frequency coefficients.

3.4. Loss

We train our model with supervised learning using

groundtruth disparity data, where the loss function contains

two terms.

12888

...

Inverse Wavelet Transform

Cost Volume

Wavelet RegressionCost Filtering

softmax

softmax

softmax

softmax

subtraction

subtraction

subtraction

weighted

summation

weighted

summation

weighted

summation

weighted

summation

Figure 3. The pipeline from a cost volume to the wavelet coefficients.

Multi-resolution

Wavelet Decomposition

Multi-resolution

Wavelet Reconstruction

Figure 4. The prediction of the wavelet coefficients of the disparity map is equivalent to the estimation of the disparity itself, in terms of

wavelet transform. Left column is the reference images, the middle column is the final estimated disparities, and the last column is the

predicted wavelet coefficients.

The first term aims at training the predictors of the

wavelet coefficients. Similar to [41], we use the smooth L1

loss between the predicted wavelet coefficients ψli (includ-

ing low-frequency and high-frequency coefficients) and the

ground truth wavelet coefficients ψli for each labeled pixel

i, for its low sensitivity to outliers, which is defined as

L1 =1

N

L∑

l=1

N∑

i=1

smoothL1

(

ψli − ψl

i

)

, (8)

in which

smoothL1(ǫ) =

{

0.5ǫ2, if |ǫ| < 1|ǫ| − 0.5, otherwise

, (9)

where N is the number of the pixels. The groundtruth

wavelet coefficients are obtained from decomposing the

groudtruth disparity map by 2-D fast wavelet transform

[34, 28].

We adopt second term to supervise the final disparity

map. The loss is defined as

L2 =1

N

N∑

i=1

smoothL1

(

di − di

)

(10)

where di and di are the predicted disparity and the

groundtruth disparity value of the pixel i, respectively.

Finally, we train the model using an end-to-end super-

vised learning mechanism with following loss function.

L = L1 + L2 (11)

4. Experiment

Experimental setup and results are presented in this sec-

tion. We not only evaluate the performances of the pro-

posed method on public stereo benchmarks and compare

them with some state-of-the-art stereo methods, but also an-

alyze the effectiveness of each proposed module by ablation

studies.

4.1. Implementation details

Datasets. We evaluate our method on the Scene Flow

[12], KITTI 2012 [10] and KITTI 2015 [32] datasets in this

work.

(i) Scene Flow [31] is a large synthetic dataset contain-

ing 35, 454 stereo pairs for training and 4, 370 for test-

ing. SceneFlow has a large scale training data, which

12889

is beneficial to the training of deep-learning-based

method. In addition, the groundtruth of Scene Flow

has no measurement error unlike that of other realis-

tic datasets, because it is completely synthetic. Com-

bined with large scale test data, it thus facilitates test-

ing the accuracy of the algorithms more thoroughly

and precisely.

(ii) KITTI is a real-world dataset with dynamic street

views from the perspective of a driving car, in which

the groundtruth depth maps for training and evaluation

are sparsely obtained from LIDAR data. It includes

KITTI 2012 and KITTI 2015, where KITTI 2012 pro-

vides 194 stereo pairs for training and 195 for evalu-

ation through its online leaderboard, and KITTI 2015

provides 200 pairs for training and 200 pairs for eval-

uation.

Training. We adopted TensorFlow [1] to train the con-

volutional neural networks with the stochastic optimization

algorithm of Adam [22], where β1 = 0.9, β2 = 0.999 and

ǫ = 10−8. For Scene Flow dataset, we used an initial learn-

ing rate of 0.001 which was kept constant for the first 10

epochs and then set to be 0.0001 until the end (approxi-

mately 25 epochs). For KITTI dataset, we fine-tuned the

model pre-trained on Scene Flow for a further 200 epochs

with a constant learning rate of 0.0001. The parameters of

the networks were initialized from random, and trained on a

Nvidia GeForce Titan RTX GPU with a batch size of 2. The

pixel intensities of each image are normalized into [−1, 1].The maximum disparity was set as Dmax = 192.

4.2. Ablation studies

In this section, we conduct several experiments on the

Scene Flow dataset to compare some different model vari-

ants and justify the effectiveness of our design choices,

where we test our networks with two widely used metrics

for evaluation:

(i) Endpoint error (EPE): the average Euclidean distance

between the pixels of estimated disparity and the

groundtruth.

(ii) Three-pixel error (> 3 pixel): the percentage of pixels

with endpoint error more than 3.

We first study the effectiveness of the deep represen-

tation module by comparing the proposed densely con-

nected atrous spatial pyramid (DCASPP) block with Vortex

Pooling[4] and atrous spatial pyramid pooling (ASPP) [6].

The experimental results are shown in first half of Table 1,

where we can see the effectiveness of DCASPP. The dif-

ference of calculation time between ASPP and DCASPP is

little, but the performance of DCASPP is significantly better

than ASPP.

Then we study the effectiveness of the wavelet learning

by several ablation experiments. First, we use CNNs to

learn only the low-frequency coefficients, which is basically

equivalent to multi-scale upsampling refinement mecha-

nism similar to StereoNet [21]. Next, we adopt CNNs to

learn the high-frequency coefficients of 1st level, 2nd level

and 3rd level step by step. The experimental results are

shown in second half of Table 1, where we can see the ef-

fectiveness of wavelet learning. Here we denote the low-

frequency as ’LF’ and the high-frequency as ’HF’.

Table 1. Results on the Scene Flow dataset. We compare differ-

ent architecture variants to justify the effectiveness of our design

choices.Model > 3 px (%) EPE Time

Vortex Pooling[4] 5.68 1.20 0.41s

ASPP[6] 5.48 1.07 0.26s

DCASPP 4.13 0.84 0.27s

Low-frequency Only 13.54 1.855 0.1s

LF + HF of Level 3 4.47 0.89 0.13s

LF + HF of Level 2 + 3 4.42 0.856 0.18s

Full Model 4.13 0.84 0.27s

Figure 5 shows the predicted disparity map of some ex-

amples on Scene Flow test data. The low-frequency re-

gions, such as the textureless surfaces, are accurately es-

timated due to the accurate prediction of the low-frequency

coefficients. Some sharp regions, like the thin surfaces, can

be also correctly estimated, which is credited to the high-

frequency predictors.

4.3. Comparisons with the stateoftheart methods

In this section, we compare the proposed algorithm with

the state-of-the-art methods. Firstly we compare the our

method with the state-of-the-art algorithms on SceneFlow

dataset, including PSMNet [5], DispFulNet [35], CRL [35],

GC-Net [20], DRR [11], Edge Stereo [41], StereoNet [21] ,

DeepPruner [8] and Stereo-DRNet [4], as presented in Ta-

ble 2. Our approach outperforms previous methods in terms

of two evaluation metrics. The methods whose results are

closest to ours are Stereo-DRNet [4] and DeepPruner [8].

Stereo-DRNet [4] focused on the refinement of the dispar-

ity. It predicts the disparities of left and right views si-

multaneously, and utilizes left-right image consistence and

disparity consistence to further refine the disparity. How-

ever, our method outperforms this architecture without such

a mechanism of left-right consistency check. DeepPruner

[8] adopted a differentiable module to discard most dispar-

ities without requiring full cost volume evaluation, in order

to speed up the estimation of the disparity. It achieved a

comparable result with Stereo-DRNet [4] on Scene Flow,

which was slightly weaker than the proposed algorithm.

Then we evaluate our model on KITTI. Table 3 and table

4 compare the error rates of some published state-of-the-art

12890

Table 2. Comparisons of stereo matching algorithms on the Scene Flow test set.

metric PSMNet [5] DispFulNet [31] CRL [35] GC-Net [20] DRR [11]

> 3 pixel (%) - 8.61 6.20 7.20 7.21

EPE 1.09 1.75 1.32 - -

metric EdgeStereo [41] StereoNet [21] Stereo-DRNet [4] DeepPruner [8] Ours

> 3 pixel (%) 4.99 - - - 4.13

EPE 1.12 1.10 0.86 0.86 0.84

Table 3. KITTI 2012 test set results [10]. This benchmark contains 194 train and 195 test image pairs.

Method Out-Noc Out-All Avg-Noc Avg-All Runtime

PSMNet [5] 1.49 % 1.89 % 0.5 px 0.6 px 0.41 s

Stereo-DRNet [4] 1.42 % 1.83 % 0.5 px 0.5 px 0.23 s

EdgeStereo [41] 1.73 % 2.18 % 0.5 px 0.6 px 0.48 s

GC-NET [20] 1.77 % 2.30 % 0.6 px 0.7 px 0.9 s

PDSNet [45] 1.92 % 2.53 % 0.9 px 1.0 px 0.5 s

SGM-Net [2] 2.29 % 3.50 % 0.7 px 0.9 px 67 s

SsSMnet [58] 2.30 % 3.00 % 0.7 px 0.8 px 0.8 s

PBCP [40] 2.36 % 3.45 % 0.7 px 0.9 px 68 s

Displets v2 [12] 2.37 % 3.09 % 0.7 px 0.8 px 265 s

Ours 1.66 % 2.18 % 0.5 px 0.6 px 0.27 s

Table 4. KITTI 2015 test set results [32]. This benchmark con-

tains 200 training and 200 test color image pairs. The qualifier

‘bg’ refers to background pixels which contain static elements, ‘fg’

refers to dynamic object pixels, while ‘all’ is all pixels (fg+bg).

The results show the percentage of pixels which have error greater

than three pixels or 5% disparity error from all 200 test images.

Method D1-bg D1-fg D1-all Time

PSMNet [5] 1.86 % 4.62 % 2.32 % 0.41 s

Stereo-DRNet [4] 1.72 % 4.95 % 2.26 % 0.23 s

EdgeStereo [41] 2.27 % 4.18 % 2.59 % 0.27 s

CRL [35] 2.48 % 3.59 % 2.67 % 0.47 s

GC-NET [20] 2.21 % 6.16 % 2.87 % 0.9 s

LRCR [19] 2.55 % 5.42 % 3.03 % 49.2 s

DRR [11] 2.58 % 6.04 % 3.16 % 0.4 s

SsSMnet [58] 2.70 % 6.92 % 3.40 % 0.8 s

Displets v2 [40] 3.00 % 5.56 % 3.43 % 265 s

PBCP [40] 2.58 % 8.74 % 3.61 % 68 s

SGM-Net [2] 2.66 % 8.64 % 3.66 % 67 s

Ours 2.12 % 5.34 % 2.65 % 0.27 s

algorithms comparable to the proposed method on KITTI

2012 and 2015 datasets, respectively. Visual results are not

shown here for conciseness. We recommend the readers to

go to the KITTI website [10] for more details.

The proposed method achieves an error rate of 2.18%in KITTI 2012 and 2.65% in KITTI 2015 (three-pixel er-

ror), which is comparable to EdgeStereo [41] but inferior

than Stereo-DRNet [4] and PSMNet [5]. However Wavelet-

Stereo significantly outperforms these methods on Scene

Flow test set. We think this difference is mainly attributed

to the fact that groundtruth of KITTI training set is sparsely

labeled, and it is difficult for which to compute the high-

frequency wavelet coefficients in too many regions. The

training of the high-frequency wavelet coefficient predic-

tion is required to large scale training data with amount of

details, and the training of the high-frequency predictors is

thus not comprehensive. In other words, the loss function

(8) almost does not work on KITTI. In addition, the sparsely

labeled data of KITTI, especially the data often unlabeled in

the high-frequency regions, may also affect the evaluation

rank of the proposed method. Scene Flow dataset contains

more than 30 thousands training images, hence the well-

learned WaveletStereo achieves the state-of-the-art perfor-

mance on this dataset.

5. Conclusion

This paper proposes a novel end-to-end deep learning

architecture with a multi-resolution wavelet reconstruction

mechanism for stereo vision, which consists of multi-scale

predictors of wavelet coefficients. The low-frequency pre-

dictors exploit global context information and well han-

dles the low-frequency regions such as textureless surfaces,

and the high-frequency predictors generate details. In addi-

tion, a module of densely connected atrous spatial pyramid

is proposed and used in the deep representation for better

learning different scales of contextual information. Exper-

imental results demonstrates the efficacy of the proposed

method.

This work is an attempt to combine the traditional image

processing technique with deep learning, for stereo match-

ing. The understanding of the wavelet will contribute to

applying the stereo algorithm more flexible. In some ap-

plications, the high-frequency predictors can be removed

to obtain better timeliness, if the details are not required.

12891

Figure 5. Qualitative results of Scene Flow test set. From left to right: left stereo input image, predicted disparity and the errors. The last

row is the color bar for error maps.

We can make a reasonable trade-off between accuracy and

speed according to application requirements.

Acknowledgement

This work is supported by the National Natural Science

Foundation of China (Grant No. U1933134, U19A2071 and

61860206007), Sichuan Science and Technology Program

(Grant No. 18YYJC1287) and the funding from Sichuan

University (Grant No. 2018SCUH0042).

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.

Tensorflow: Large-scale machine learning on heterogeneous

distributed systems. arXiv: Distributed, Parallel, and Clus-

ter Computing, 2015.

12892

[2] S. Akihito and P. Marc. Sgm-nets: Semi-global matching

with neural networks. In The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages 231–

240, June 2017.

[3] E. J. Candes. The curvelet transform for image denoising.

1:7, 2001.

[4] R. Chabra, J. Straub, C. Sweeney, R. A. Newcombe, and

H. Fuchs. Stereodrnet: Dilated residual stereo net. arXiv:

Computer Vision and Pattern Recognition, 2019.

[5] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching net-

work. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 5410–5418, 2018.

[6] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-

thinking atrous convolution for semantic image segmenta-

tion. arXiv: Computer Vision and Pattern Recognition, 2017.

[7] F. Cotter and N. G. Kingsbury. Deep learning in the wavelet

domain. arXiv: Computer Vision and Pattern Recognition,

2018.

[8] S. Duggal, S. Wang, R. H. Wei-Chiu Ma, and R. Urtasun.

Deeppruner: Learning efficient stereo matching via differen-

tiable patchmatch. In International Conference on Computer

Vision, pages 4384–4393, 2019.

[9] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets

robotics: The kitti dataset. International Journal of Robotics

Research (IJRR), 2013.

[10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-

tonomous driving? the kitti vision benchmark suite. In

Conference on Computer Vision and Pattern Recognition

(CVPR), 2012.

[11] S. Gidaris and N. Komodakis. Detect, replace, refine: Deep

structured prediction for pixel wise labeling. computer vision

and pattern recognition, pages 7187–7196, 2017.

[12] F. Guney and A. Geiger. Displets: Resolving stereo ambigu-

ities using object knowledge. 2015.

[13] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li. Group-

wise correlation stereo network. arXiv: Computer Vision

and Pattern Recognition, 2019.

[14] R. Haeusler, R. Nair, and D. Kondermann. Ensemble learn-

ing for confidence measures in stereo vision. pages 305–312,

2014.

[15] H. Hirschmuller. Stereo processing by semiglobal match-

ing and mutual information. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 30(2):328–341, 2008.

[16] A. Hosni, M. Bleyer, M. Gelautz, and C. Rhemann. Local

stereo matching using geodesic support weights. In Interna-

tional Conference on Image Processing, pages 2093–2096,

2009.

[17] G. Huang, Z. Liu, L. V. Der Maaten, and K. Q. Weinberger.

Densely connected convolutional networks. In IEEE Con-

ference onComputer Vision and Pattern Recognition, pages

2261–2269, 2017.

[18] Z. Jie, P. Wang, Y. Ling, B. Zhao, Y. Wei, J. Feng, and

W. Liu. Left-right comparative recurrent model for stereo

matching. In IEEE Conference onComputer Vision and Pat-

tern Recognition, pages 3838—-3846, 2018.

[19] Z. Jie, P. Wang, Y. Ling, B. Zhao, Y. Wei, J. Feng, and

W. Liu. Left-right comparative recurrent model for stereo

matching. In Proceedings of the IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 3838–3846,

2018.

[20] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry,

R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning

of geometry and context for deep stereo regression. CoRR,

vol. abs/1703.04309, 2017.

[21] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin,

and S. Izadi. Stereonet: Guided hierarchical refinement

for real-time edge-aware depth prediction. arXiv preprint

arXiv:1807.08865, 2018.

[22] D. P. Kingma and J. Ba. Adam: A method for stochastic op-

timization. international conference on learning representa-

tions, 2015.

[23] A. Klaus, M. Sormann, and K. Karner. Segment-based stereo

matching using belief propagation and a self-adapting dis-

similarity measure. In IEEE International Conference on

Pattern Recognition, volume 3, pages 15–18, 2006.

[24] V. Kolmogorov and R. Zabih. Computing visual correspon-

dence with occlusions using graph cuts. In IEEE Inter-

national Conference on Computer Vision, volume 2, pages

508–515, 2001.

[25] N. Komodakis and N. Paragios. Beyond pairwise energies:

Efficient optimization for higher-order mrfs. In IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

2985–2992, 2009.

[26] D. Kong and H. Tao. A method for learning matching errors

for stereo computation. In Proceedings of the British Ma-

chine Vision Conference, pages 11.1–11.10. BMVA Press,

2004. doi:10.5244/C.18.11.

[27] D. Kong and H. Tao. H.: Stereo matching via learning mul-

tiple experts behaviors. In: BMVC, pages 97–106, 2006.

[28] G. R. Lee, R. Gommers, F. Wasilewski, K. Wohlfahrt, and

A. O’Leary. Pywavelets: A python package for wavelet anal-

ysis. Journal of Open Source Software, 4(36):1237, 2019.

[29] H. Li, B. S. Manjunath, and S. K. Mitra. Multisensor image

fusion using the wavelet transform. Graphical Models and

Image Processing, 57(3):235–245, 1995.

[30] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learn-

ing for stereo matching. In The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), June 2016.

[31] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,

A. Dosovitskiy, and T. Brox. A large dataset to train convo-

lutional networks for disparity, optical flow, and scene flow

estimation. In The IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), June 2016.

[32] M. Menze and A. Geiger. Object scene flow for autonomous

vehicles. In Conference on Computer Vision and Pattern

Recognition (CVPR), 2015.

[33] D. Miyazaki, Y. Matsushita, and K. Ikeuchi. Interactive

shadow removal from a single image using hierarchical

graph cut. Asian Conference on Computer Vision, pages

234–245, 2010.

[34] S. Naik and N. Patel. Single image super resolution in spatial

and wavelet domain. The International Journal of Multime-

dia & Its Applications, 5(4):23–32, 2013.

12893

[35] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan. Cascade

residual learning: A two-stage convolutional neural network

for stereo matching. In ICCV Workshops, volume 7, 2017.

[36] N. Papadakis and V. Caselles. Multi-label depth estimation

for graph cuts stereo problems. Journal of Mathematical

Imaging and Vision, 38(1):70–82, 2010.

[37] M. Peris, S. Martull, A. Maki, Y. Ohkawa, and K. Fukui. To-

wards a simulation driven stereo vision system. In Proceed-

ings of the 21st International Conference on Pattern Recog-

nition, pages 1038–1042, 2012.

[38] D. Scharstein and C. Pal. Learning conditional random fields

for stereo. In Conference on Computer Vision and Pattern

Recognition (CVPR), pages 1–8, 2007.

[39] D. Scharstein and R. Szeliski. A taxonomy and evaluation of

dense two-frame stereo correspondence algorithms. Interna-

tional Journal of Computer Vision, 47((1/2/3)):7–42, 2002.

[40] A. Seki and M. Pollefeys. Patch based confidence prediction

for dense disparity map. British Machine Vision Conference

(BMVC), 2016.

[41] X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A con-

text integrated residual pyramid network for stereo matching.

arXiv preprint arXiv:1803.05196, 2018.

[42] A. Spyropoulos, N. Komodakis, and P. Mordohai. Learning

to detect ground control points for improving the accuracy of

stereo matching. pages 1621–1628, 20014.

[43] J. Starck, E. J. Candes, and D. L. Donoho. The curvelet

transform for image denoising. IEEE Transactions on Image

Processing, 11(6):670–684, 2002.

[44] H. Tong, M. Li, H. Zhang, and C. Zhang. Blur detection for

digital images using wavelet transform. 1:17–20, 2004.

[45] S. Tulyakov, A. Ivanov, and F. Fleuret. Practical deep stereo

(pds): Toward applications-friendly deep stereo matching.

arXiv preprint arXiv:1806.01677, 2018.

[46] T. Williams and R. Li. Wavelet pooling for convolutional

neural networks. 2018.

[47] O. Woodford, P. Torr, I. Reid, and A. Fitzgibbon. Global

stereo reconstruction under second-order smoothness priors.

IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 31(12):2115–2128, 2009.

[48] M. Yang, Y. Liu, and Z. You. The euclidean embedding

learning based on convolutional neural network for stereo

matching. Neurocomputing, 267:195–200, 2017.

[49] M. Yang, Y. Liu, Z. You, X. Li, and Y. Zhang. A homography

transform based higher-order mrf model for stereo matching.

Pattern Recognition Letters, 40:66–71, 2014.

[50] M. Yang and X. Lv. Learning both matching cost and

smoothness constraint for stereo matching. Neurocomput-

ing, 2018.

[51] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister.

Stereo matching with color-weighted correlation, hierar-

chical belief propagation, and occlusion handling. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

31(3):492–504, 2009.

[52] Q. Yang, R. Yang, J. Davis, and D. Nister. Spatial-depth

super resolution for range images. In IEEE Conference on

Computer Vision and Pattern Recognition, pages 1–8, 2007.

[53] K.-J. Yoon and I.-S. Kweon. Adaptive support-weight ap-

proach for correspondence search. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 23(4):650–656,

2006.

[54] L. Yu, Y. Wang, Y. Wu, and Y. Jia. Deep stereo matching with

explicit cost aggregation sub-architecture. arXiv: Computer

Vision and Pattern Recognition, 2018.

[55] J. Zbontar and Y. LeCun. Stereo matching by training a con-

volutional neural network to compare image patches. Sub-

mitted to JMLR, 2015.

[56] L. Zhang and S. M. Seitz. Estimating optimal parameters for

mrf stereo from a single image pair. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 29(2):331–342,

2007.

[57] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kow-

dle, V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser,

and S. Fanello. Activestereonet: End-to-end self-

supervised learning for active stereo systems. arXiv preprint

arXiv:1807.06009, 2018.

[58] Y. Zhong, Y. Dai, and H. Li. Self-supervised learning for

stereo matching with self-improving ability. arXiv preprint

arXiv:1709.00930, 2017.

[59] Y. Zhong, H. Li, and Y. Dai. Open-world stereo video match-

ing with deep rnn. In European Conference on Computer

Vision, pages 104–119, 2018.

[60] C. Zitnick and S. Kang. Stereo for image-based rendering us-

ing image over-segmentation. International Journal of Com-

puter Vision, 75(1):49–65, 2007.

12894

waveletstereo: learning wavelet coefficients of disparity map in … · 2020. 6. 29. · ward...

Documents