i spy with my little eye: learning optimal filters for cross-modal ... · i spy with my little eye:...

I spy with my little eye:Learning Optimal Filters for Cross-Modal Stereo under Projected Patterns

Wei-Chen Chiu Ulf Blanke Mario FritzMax-Planck-Institute for Informatics, Saarbrucken, Germany

{walon, blanke, mfritz}@mpi-inf.mpg.de

Abstract

With the introduction of the Kinect as a gaming inter-faces, its broad commercial accessibility and high qualitydepth sensor has attracted the attention not only from con-sumers but also from researchers in the robotics commu-nity. The active sensing technique of the Kinect producesrobust depth maps for reliable human pose estimation. Butfor a broader range of applications in robotic perception, itsactive sensing approach fails under many operating condi-tions such like objects with specular and transparent sur-faces.

Recently, an initial study has shown that part of the aris-ing problems can be alleviated by complimenting the activesensing scheme with passive, cross–modal stereo betweenthe Kinect’s rgb and ir camera. However, the method istroubled by interference from the IR projector that is re-quired for the active depth sensing method. We investigatethese issues and conduct a more detailed study of the physi-cal characteristics of the sensors as well as propose a moregeneral method that learns optimal filters for cross–modalstereo under projected patterns. Our approach improvesresults over the baseline in a point-cloud-based object seg-mentation task without modifications of the kinect hardwareand despite the interference by the projector.

1. Introduction

Despite the advance of elaborated global (e.g. [2]) andsemi-global (e.g. [5]) stereo matching techniques, real-time stereo on standard hardware is still dominated by localmethod based on patch comparisons. The more surprising itis that we have seen very little work on improving the cor-respondences by a learning approach that would be bettersuited to a certain setting or conditions [10]. Yet the use ofdifferent pre-filters that are used by practitioners to improvethe matching process are clear evidence that there is roomfor improvement over basic patch-based differences.

In our case the need for learning is even more apparent

(a)

(b) (c)

(d) (e)

Figure 1: 1(a) Response of RGB camera (left) and IR cam-era (right). 1(b) and 1(c) Image pair obtained by Kinectwith projected IR pattern. 1(d) Disparity map on unfilteredpairs. 1(e) Disparity map on patch-filtered image pairs.

as we attempt cross-modal matching between the IR andRGB image obtained from the Kinect sensor. Such a sys-tem was recently proposed [1] which augments the activesensing strategy of the kinect by a passive stereo algorithmbetween the two available imagers. A very simple pixelbased re-weighting scheme was proposed that produces anIR like image for improved depth estimates.

w_ir

w_rw_gw_b

Stereo

Fusion

Disparity map

Depth

RGB IRPointcloudCross-modal adaptation

IR-like image

(a)

Figure 2: Diagrams for weighted fusion scheme.

This paper identifies three issues and consequently im-proves over the previous work threefold:

First, we take a closer look at the sensor characteristicsof the kinect and realize that the overlap in the spectral re-sponse between the sensors is very small. This argues for alearning based approach that exploits smoothness and cor-relations in the BRDF function of the materials, as no sat-isfactory linear reconstruction of the IR channel is possible.Nevertheless we attempt such a reconstruction and add thisas a baseline.

Second, as argued above we know from practical con-siderations that patch based stereo matching is often im-proved by pre-filter operations. This is a richer class thanthe pixel based weighting previously investigated. We pro-pose a method for learning optimal filters for improvingcross–modal stereo that is rich enough to capture channel-based weighting and filtering like sharpening, smoothingand edge detection.

Third, we realize that for the best results previously ob-tained [1] the IR projector had to be covered for captur-ing the IR-RGB pair. However, this is impracticable as iteliminates the active depth sensing scheme. We show thatour learning based algorithm can achieve robustness of thestereo algorithm to these nuisances introduced by the pro-jector – and in fact we are able to recover the performancepreviously only achieved with the covered projector.

2. Related WorkStereo vision has a far back reaching history and is stud-

ied well in the past decades. However, lighting conditionsor specific material property such as transparency and spec-ularity [7] still complicate stereo matching. In practice suchvariations are typically reduced by filtering techniques (e.g.,laplacian of gaussians [9]), non-parametric matching costs(e.g., census [13]) or by hand tuning parameters for opti-mal matching. [6] provide a thorough comparison of severalstereo matching techniques with respect to complex radio-metric variations. They compare a large set of filters, andrank them according to performance and computational ef-ficiency.

More recently the path of machine learning is taken tofind automatically optimal models for stereo matching [10].Also [6] propose to learn pixelwise cost based on mutualinformation from ground truth data. However, both ap-proaches are global-based matching scheme and prohibitreal time applications. Also the sensitivity to local changes[6] limits its applicability for matching across modalitiesthat exhibit global as well as local variation.

Objects with transparent or specular surfaces have beenalso focused on as a detection task. [3] learns object mod-els from data and [8] detect transparent objects employinga second time of flight camera. For increasing robustnessof visual category recognition across different sources suchas from the web, high quality DSLRs or webcams [12] themetric learning formulation proved to be a successful.

The Kinect depth estimate yields impressive results on(un)structured lambertian surfaces. Yet it completely failson specular, transparent or reflective surfaces. This issueis addressed in previous work [1] by using a cross–modalstereo approach based on built–in RGB and IR sensor thatcomplements the Kinect depth estimate.

In order to study the similarity between IR and RGBthe authors of [1] investigate various fusion schemes andpresent a pixel-based optimization based on ground truth(see Fig 2). In contrast to their work, we replace the pixelbased weighting and focus on learning patch based filters.

3. Capturing and Analyzing Sensor Character-istics of the Kinect

two slits & grating light source

lambertian surface

Kinect

closed box

Figure 3: Schematic for experimental setup for reading sen-sor characteristics.

In [1] a mapping between IR and RGB was learned frompatterns that were illuminated by environmental light. How-ever, there was no justification given if there is any hope toactually recover the sensor response characteristic by a lin-ear combination of the RGB channels. Therefore we pro-vide here a first analysis of the sensor characteristics of theimagers in the Kinect.

To this end we capture diffracted light which is pro-jected on a “white” surface. This allows us to determine

the characteristics of the Kinect cameras by measuring theirresponse to different wavelength (see Fig 1).

The setup is depicted in Fig. 3. Environmental light isshielded so that we are only capturing the relevant wave-length. A special target that is almost perfectly lambertianensures that the results are not corrupted by any speculareffects. A light source is directed toward two small slitsthat serve as an aperture for selecting close to parallel lightrays. This minimizes overlap between nearby wavelengthon our target. Behind the slits a optical grating patterncauses diffraction which separates out the different wave-length. The light source is a 500 Watts Halogen light which– as a black-body-like radiator – emits light across the vi-sual spectrum well into the infrared part, following roughlyPlanck’s law [11]. We do not use a calibrated light sourcein this study and consider it of lesser importance as we aremostly interested in relative sensitivities under naturally oc-curring light.

After acquiring reference images in ambient light, wecalibrate the images and calculate the response profileacross wavelength. We do this for each RGB channel sep-arately, as well for the IR-image. (See Fig 4 bottom). Thisgives us the sensitivity for each channel independently.

Having an estimate for the sensor response characteris-tics, we can now estimate a reconstruction of the IR sensorby a linear weighting of the RGB responses. Therefore wefind the following least squares solution:

minw||Rir − [RrRgRb]w||2 (1)

where Rir is the spectral response of the IR sensor andRr, Rg, Rb are the responses of the red, blue and greenchannel respectively.

Results Fig 5 depicts the sensor readings we have ob-tained. The raw sensor data is plotted in pale colors, whilethe saturated colors show a gaussian fit. For each chan-nel we subtract the minimum response in order to compen-sate for sensor noise and residual ambient light and then fita gaussian mixture model with 3 modes as we observe 3maxima of the diffraction pattern. The dominant mode isplotted per channel. There are 4 IR channels as we read theraw IR image from the Kinect that comes in a bayer pat-tern. We expected slightly different response characteristicsfor each channel, but they turn out to be almost identical.Furthermore we observe that the overlap between IR andRGB-channels is relatively small. The linear reconstructionof the IR channel from Equation 1 results in the cyan lineshown in Fig 5. As the profile is very flat, we also showan amplified version. The low magnitude indicates that thereconstruction is not working well. The weights for the in-dividual channels are as follows: wred = 0.0111, wgreen =−0.0066, wblue = 0.0022. Obviously, the red channel has

rgb ir

blue green red ir

Figure 4: Spectrum from Experiment as in Fig 3.

Vertical Position

Res

pons

e

redgreenblueir1ir2ir3ir4rgb mimics irrgb mimcs ir amplified

raw / �tted

Figure 5: Spectrum from Experiment as in Fig 3. Lightcolored plots correspond to the response along blue linesthrough raw image data (Fig 4). Strong colored plots corre-spond to Gaussian fitted curves.

the highest weight, as it is closest to the infrared part. Inter-estingly, we get a negative weight for green which “pushes”the red channel further to the infrared part. The positiveweight for blue again compensates partially for the intro-duced dip in the green to blue wavelength. This is also aninteresting parallel to [1] where similar weights were ob-tained by training pixel correspondences without explicitknowledge of the spectral sensitivities.

In summary, we have to conclude that the overlap of theIR and RGB sensitivity of the sensor is indeed smaller thanexpected, which seems very bad news for any cross–modalmatching attempt. However, in practice we do often havelight sources that cover a reasonable part of the spectrum– like the above used halogen lamp – and in addition typ-ical materials also reflect in a relative broad and smoothspectrum. This gives a justification to learning-based ap-proaches like the one in [1] that can exploit correlations andsmoothness of BRDFs.

4. Methods

Our aim is to increase robustness as well as computa-tional efficiency of cross–modal stereo under projected pat-terns by learned filters. A simplistic scheme that is ex-

clusively based on pixel-wise re-weighting of IR and RGBseems to be too limited. A learning-based version of thislinear scheme was attempted in [1] and we also derived aweighting based on spectral measurements in the previoussection.

As we want to stay in the roam of efficient patch–based stereo algorithm, we propose to extend the class oflearned transformation to linear filters that leverage a pixelneighborhood in all channels to optimally preserve matchesacross modalities. These linear filters encompass smooth-ing, sharpening and edge detection methods that have beenshown useful as prefilter in stereo algorithm and can poten-tially alleviate problems with the projected pattern.

The core idea is to collect corresponding pairs of patchesbetween IR and RGB images into a set S for the trainingstep. Then we use them to determine the weightings ofeach elements in the IR and RGB patches so that the corre-sponding patches have a smaller distance after the transfor-mation. We employ an optimization framework to describethis problem.

We denote the s–th corresponding pair of patches by{IRs, Cs} ∈ S where C = {r, g, b} contains three colorchannels from the RGB image. With the assumption thatthe patch is in the size of n × n, we would like to ob-tain the different weightings {wIR

i,j , wCi,j} for every pix-

els {IRsi,j , C

si,j} of different positions (i, j) within IR and

RGB patches {IRs, Cs}. The resulting optimization prob-lem reads:

minwIR,wC

Xs∈S

‚‚‚‚‚‚nX

i=1

nXj=1

wIRi,j IRs

i,j −X

C=r,g,b

nXi=1

nXj=1

wCi,jC

si,j + b

‚‚‚‚‚‚1

subject toX

C=r,g,b

nXi=1

nXj=1

wCi,j = 1.

(2)where b is an offset. By applying these weightings for eachcolor channels of RGB images and for IR images, we cantransform the RGB images into “IR-like” images then usethe same stereo matching algorithm to compute the dispar-ity maps as usual. Note that this weighting procedure is thesame as utilizing filters for images. We display an instanceof our proposed filtering procedure in Figure 8.

5. Experiments

In order to evaluate the effectiveness of our approachwe compare to the results from [1] in the same experimen-tal setting. A clustering approach is used to segment ob-jects in a table-top scenario. The dataset was deliberatelydesigned to expose problems of the standard kinect depthsensing scheme on materials that are, e.g., specular or trans-parent. The groundtruth is given as 2d bounding boxes andthe PASCAL matching criterion is used that requires theintersection over union between groundtruth and detection

bounding box to be larger than 0.5. The dataset consist of106 objects in 19 scenes.

5.1. Learning Filters

Given image pairs of IR and RGB images, our goal isto learn optimal adaptation between IR and RGB imagesusing the Kinect hardware without any modifications. Wemanually collect thousands of corresponding pairs of 3× 3patches between low–resolution IR and RGB images un-der the influence of the IR–projector. The patches are dis-tributed over normal and difficult regions including trans-parent, specular and reflective surfaces. To solve the opti-mization problem in Equation 2, we use cvx [4], a matlab–based toolbox for convex optimization.

The resulting filters wr, wg, wb, and wIR are as followswith the offset b = −56.6978:

wr =

0.1451 0.1900 0.1228−0.0354 0.0089 0.12440.1788 0.0985 0.1809

wg =

0.1844 −0.0806 0.12490.1866 −0.1393 0.11290.1981 −0.0841 0.0841

wb =

−0.0702 −0.0260 −0.0098−0.1430 −0.0915 −0.0600−0.0984 −0.0654 −0.0365

wIR =

0.0049 0.0961 −0.00060.1532 −1.0000 0.1084−0.0018 0.0741 −0.0062

(3)

Visualizations are provided in Figure 8.

5.2. Evaluation

Our evaluation uses the same data as [1] and is consis-tent with their setting in order to ensure comparability. Thestereo image pairs from the Kinect were obtained under twoconditions. The first one is to cover the IR projector and thesecond one is under the normal situation with the IR projec-tor.

From the evaluation of different fusion schemes un-der these two settings, the best results previously obtainedare an average precision of 69% with IR–projector–on and72.5% with IR–projector–off by using a late fusion strategy.

Comparing to the average precision 46% of the built-inKinect depth estimate, the method can improve the Kinectdepth over 20% but meet the practical issues as we men-tioned in Section 4. In contrast, our method simply uses theearly fusion scheme with applying the filters on IR–RGBimages captured under IR–projector–on but still achieves anaverage precision of 71.5%.

The Precision–Recall curves of the late fusion schemeunder IR–projector–off setting, our proposed method, andKinect–only are plotted in Figure 6.

(a) (b) (c)

Figure 7: Disparity maps computed from (a) original image and (b) IR–RGB–image pair without IR–projector, and from (c)images filtered by our trained filters.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 - precision

reca

ll

kinect only (AP = 46.0)spectral (AP = 64.6)best for IR-on (AP = 69.0)proposed method (AP = 71.5)best for IR-off (AP = 72.5)

Figure 6: Precision–Recall curves of late fusion schemeunder IR–projector–off setting, our proposed method andKinect–only.

Our approach outperforms all their settings using IR–projector–on and is on par with their best result with mod-ified hardware. Also note that our method shows strongimprovements in precision and produces the first false pos-itives not until almost 40% recall which is 10% more thanthe competing methods. In Figure 7 we show examplarydisparity maps computed from images in the IR–projectoroff case, and filtered images by our proposed method.

5.3. Discussion

In the middle column of Figure 8, we present the vi-sualization of the filters obtained from optimization pro-cess. The filters of red channel and blue channel resem-ble smoothing operators and the filter of green channel, thesmoothing seems to be applied along the y–axis while thex–axis direction resembles a Laplacian operator. The fil-ter of IR channel basically computes a filter similar to a

2–dimensional Laplacian operator.

6. Conclusions

We have presented a method to optimize filters for im-proved stereo correspondence IR and RGB images that isrobust to projected IR patterns. We have experimentally an-alyzed the spectral characteristics of the Kinect cameras inorder to justify such an approach. Adapting RGB in fre-quency domain to mimic an IR image did not yield im-proved performance. The small overlap between RGB andIR seems prohibiting this approach. In contrast, learningseveral filters based on image patches allowed improvedstereo vision across modalities. We conclude therefore, thatour pre-filtered, cross-modal, SAD-based stereo vision al-gorithm profits most from combination in the spatial do-main, rather than in the frequency domain. Our patch-basedapproach shows increased performance and improved ro-bustness against IR-specific interference from the projector.

However, the Kinect hardware limitation still disallowsus to capture RGB and IR simultaneously. This requiresto switch both channels by software and limits the framerate to 2-3fps. While sufficient for many applications, anincreased frame rate is certainly preferable.

Upon publication, we will make the source code of ourmethod available.

7. Acknowledgements

We would like to thank Ivo Ihrke for helpful discussionsand suggestions.

References

[1] W. C. Chiu, U. Blanke, and M. Fritz. Improving thekinect by cross-modal stereo. In BMVC, 2011. 1, 2, 3,4

RGB

R

G

B

wr

wg

wb

wIR

Figure 8: (Left) Incoming RGB and IR image with IR–projector–on from Kinect. (Top) color channels from RGB image:red, green, blue and learned filters wR, wG, wB , wIR from optimization. Note here we do the zero–padding and up–samplethe filter for a better visualization. The resulting images filtered for each color channels. (Right) transformed RGB (summedfiltered images for three color channels) and IR images.

[2] P. F. Felzenszwalb and D. P. Huttenlocher. Efficientbelief propagation for early vision. IJCV, 70(1):41–54, 2006. 1

[3] M. Fritz, M. Black, G. Bradski, S. Karayev, andT. Darrell. An additive latent feature model for trans-parent object recognition. In NIPS, 2009. 2

[4] M. Grant and S. Boyd. CVX: Matlab software for dis-ciplined convex programming, version 1.21. http://cvxr.com/cvx, Apr. 2011. 4

[5] H. Hirschmuller. Accurate and efficient stereo pro-cessing by semi-global matching and mutual informa-tion. In CVPR, 2005. 1

[6] H. Hirschmuller and D. Scharstein. Evaluation ofstereo matching costs on images with radiometric dif-ferences. TPAMI, 31(9):1582–1599, 2009. 2

[7] I. Ihrke, K. N. Kutulakos, H. P. A. Lensch, M. Magnor,and W. Heidrich. State of the art in transparent andspecular object reconstruction. In STAR Proceedingsof Eurographics, 2008. 2

[8] U. Klank, D. Carton, and M. Beetz. Transparent objectdetection and reconstruction on a mobile platform. InICRA, 2011. 2

[9] K. Konolige. Small vision systems: Hardwareand implementation. In ROBOTICS RESEARCH-INTERNATIONAL SYMPOSIUM-, volume 8, pages203–212. MIT PRESS, 1998. 2

[10] Y. Li and D. P. Huttenlocher. Learning for stereo vi-sion using the structured support vector machine. InCVPR, 2008. 1, 2

[11] M. Planck. On the law of distribution of energy in thenormal spectrum. Annalen der Physik, 4(553):1, 1901.3

[12] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapt-ing visual category models to new domains. In ECCV,2010. 2

[13] R. Zabih and J. Woodfill. Non-parametric local trans-forms for computing visual correspondence. In ECCV,1994. 2

http://cvxr.com/cvx

http://cvxr.com/cvx

i spy with my little eye: learning optimal filters for cross-modal ... · i spy with my little eye:...

Documents