joint inpainting of rgb and depth images by generative...

1
Joint Inpainting of RGB and Depth Images by Generative Adversarial Network with a Late Fusion approach Ryo Fujii, Ryo Hachiuma, Hideo Saito Keio University, Japan {ryo.fujii0112, ryo-hachiuma, hs}@keio.jp Abstract References Background Diminished reality aims to remove real objects from images and fill in the removed regions with plausible textures. Multi-view observations [1,2] Ø Accurate Ø Cannot restore unobserved area, multiple camera setup Inpainting using pixels in the image [3] Ø No need of other camera or recorded observation Ø Inpainted regions are predicted from surrounding regions If the filled pixels are plausible, the inpainting-based methods have an advantage over the multi-view based methods. Goal : Filling in missing regions of both RGB and depth images with plausible textures and geometries. Contribution: The output features of RGB and depth encoder are added and used as the input of fusion part. This late fusion enables each decoder to utilize each feature as complementary information. Discriminator Network Input: four channels, RGB-D image. Global discriminator: Judges the consistency of scene Local discriminator: Assesses the quality of small completed area Result Global Discriminator Local Discriminator Input and Mask Inpainted image Completion Network Real or Fake Real or Fake RGB Encoder Depth Encoder RGB Decoder Depth Decoder Fusion Part Task: Simultaneous RGB and depth image inpainting Input: RGB and depth image with missing regions Output: inpainted RGB and depth image Deep Neural Network We first propose a deep learning network that jointly inpaint RGB and depth image while leveraging each others information. Method Loss Function To optimize the network, we combine " # reconstruction loss WGAN-GP loss[4] WGAN-GP works well when combined with " # reconstruction loss as they both use the " # distance metrics. WGAN-GP learning behave faster and more stable in convergence. Network Architecture: We expanded the completion network proposed by Iizuka et. al.[5] Completion Network RGB encoder-decoder Depth encoder-decoder Fusion Part The extracted feature of RGB and depth are fused and we employed Residual dilated convolutional layers (yellow) in Fusion part. Dataset: SceneNet RGB-D dataset [6] Rendered RGB-D images from over 15K trajectories in synthetic layouts. Training procedure: Training images: About 2 million images resized to 160×160 Mask: 1/8 to 1/4 of original size Batch size: 96 images Iteration: 75000 iterations Training time: About 2 days by two Nvidia Quadro GV100 and P6000 GPU Optimizer: Adam Qualitative evaluation: Input RGB or Depth Generated Result Ground truth Future work: Quantitative evaluation Ø Comparison our baseline model with early fusion approach Examination of the input Ø Take advantage of normal map instead of depth map. Late fusion approach made the edge clear of each restored region clearly. The depth completion sometimes fails as shown in third column of above figure. Ø In the most common failure case, the similar texture appears. [1] N.Kawai, T.Sato, andN.Yokoya. Diminished reality based on image inpainting considering background geometry. IEEE Trans. Vis. Comput. Graphics, 22:1–1, Jan. 2015. [2] S.Mori,J.Herling,W.Broll,N.Kawai,H.Saito,D.Schmalstieg,and D. Kalkofen. 3d pixmix: Image-inpainting in 3d environments. In Adjunct Proc. of the IEEE ISMAR, 2018. [3] S. Meerits and H. Saito. Real-time diminished reality for dynamic scenes. In Proc. of the IEEE ISMARW, pp. 53–59, 2015. [4] Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron. Improved Training of Wasserstein GANs. In NIPS’17. [5] S.Iizuka, E.Simo-Serra and H.Ishikawa. GloballyandLocallyConsistent Image Completion. SIGGRAPH, 36(4):107:1–107:14, 2017. [6] J.McCormac ,A.Handa, S.Leutenegger and A.J.Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017. Application Ø Integrate this model into virtual reality application with HMD Filling the holes because of foreground object occlusion due to the viewpoint changes caused by the user’s head movements.

Upload: others

Post on 27-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint Inpainting of RGB and Depth Images by Generative ...hvrl.ics.keio.ac.jp/hachiuma/poster_fujii_final.pdf · •Inpainting using pixels in the image [3] ØNo need of other camera

Joint Inpainting of RGB and Depth Images by Generative Adversarial Network with a Late Fusion approach

Ryo Fujii, Ryo Hachiuma, Hideo SaitoKeio University, Japan

{ryo.fujii0112, ryo-hachiuma, hs}@keio.jp

Abstract

References

BackgroundDiminished reality aims to remove real objects from images and fill in the removed regions with plausible textures. • Multi-view observations [1,2]

Ø Accurate Ø Cannot restore unobserved area, multiple camera setup ✘

• Inpainting using pixels in the image [3] Ø No need of other camera or recorded observation Ø Inpainted regions are predicted from surrounding regions ✘

If the filled pixels are plausible, the inpainting-based methods have an advantage over the multi-view based methods.

Goal:Filling in missing regions of both RGB and depth images with plausible textures and geometries. Contribution: The output features of RGB and depth encoder are added and used as the input of fusion part. This late fusion enables each decoder to utilize each feature as complementary information.

Discriminator NetworkInput: four channels, RGB-D image. • Global discriminator: Judges the consistency of

scene• Local discriminator: Assesses the quality of

small completed area

Result

Global Discriminator

Local DiscriminatorInput and Mask Inpainted imageCompletion Network

Real or Fake

Real or Fake

⋮ ⋮

⋮⋮

RGB Encoder

Depth Encoder

RGB Decoder

Depth Decoder

Fusion Part

Task: Simultaneous RGB and depth image inpainting

Input: RGB and depth image with missing regions

Output: inpaintedRGB and depth

image

Deep Neural Network

We first propose a deep learning network that jointly inpaint RGB and depth image while leveraging each others information.

Method

Loss FunctionTo optimize the network, we combine • "# reconstruction loss• WGAN-GP loss[4]

WGAN-GP works well when combined with "#reconstruction loss as they both use the "# distance metrics. WGAN-GP learning behave faster and more stable in convergence.

Network Architecture:We expanded the completion network proposed by Iizuka et. al.[5]

Completion Network• RGB encoder-decoder• Depth encoder-decoder • Fusion Part

The extracted feature of RGB and depth are fused and we employed Residual dilated convolutional layers (yellow) in Fusion part.

Dataset:SceneNet RGB-D dataset [6] Rendered RGB-D images from over 15K trajectories in synthetic layouts.

Training procedure:

• Training images: About 2 million images resized to 160×160

• Mask: 1/8 to 1/4 of original size• Batch size: 96 images• Iteration: 75000 iterations• Training time: About 2 days by two Nvidia

Quadro GV100 and P6000 GPU• Optimizer: Adam

Qualitative evaluation:

Input RGB or Depth

Generated Result

Ground truth

Future work:• Quantitative evaluation

Ø Comparison our baseline model with early fusion approach• Examination of the input

Ø Take advantage of normal map instead of depth map.

• Late fusion approach made the edge clear of each restored region clearly.• The depth completion sometimes fails as shown in third column of above figure.

Ø In the most common failure case, the similar texture appears.

[1] N.Kawai, T.Sato, andN.Yokoya. Diminished reality based on image inpainting considering background geometry. IEEE Trans. Vis. Comput. Graphics, 22:1–1, Jan. 2015.[2] S.Mori,J.Herling,W.Broll,N.Kawai,H.Saito,D.Schmalstieg,and D. Kalkofen. 3d pixmix: Image-inpainting in 3d environments. In Adjunct Proc. of the IEEE ISMAR, 2018. [3] S. Meerits and H. Saito. Real-time diminished reality for dynamic scenes. In Proc. of the IEEE ISMARW, pp. 53–59, 2015. [4] Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron. Improved Training of Wasserstein GANs. In NIPS’17. [5] S.Iizuka, E.Simo-Serra and H.Ishikawa. GloballyandLocallyConsistent Image Completion. SIGGRAPH, 36(4):107:1–107:14, 2017. [6] J.McCormac ,A.Handa, S.Leutenegger and A.J.Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017.

• ApplicationØ Integrate this model into virtual reality application with HMD

Filling the holes because of foreground object occlusion due to the viewpoint changes caused by the user’s head movements.