rgb d video enhancement and applications · lu sheng, thesis oral defense. structures in a local...

Probabilistic Approaches for RGB-D Video Enhancement and Applications

Speaker: Lu ShengSupervisor: Prof. King Ngi Ngan

Lu Sheng, Thesis Oral Defense

Why RGB-D Data Essential?

RGB: 2D visual pattern Depth: 3D geometry

RGB image cannot explicitly tells the computer the 3Dstructure of each object

Depth cannot tell us the texture patterns overlaid

RGB + Depth helps us to comprehensively understand the3D visual world



Explosive growth of 3D applications

3D reconstruction Novel view synthesis

Virtual reality / Augmented reality 3DTV & FTV Refocus

Motion sensing /gesture recognition



Explosive growth of 3D applications

Autonomousnavigation & safety

Personal & industrial robots

Scene understanding Pedestrian detection Action recognitionLu Sheng, Thesis Oral Defense

Stereo vision

Shape-from-shading Structure-from-motion

Recent Depth Acquisition Methods

L R

Drawbacks

Usually computationally intensive

Mediocre quality

Require simple or artificial shooting conditions.Lu Sheng, Thesis Oral Defense

Recent Depth Acquisition Methods

Kinect Time-of-flight camera Laser scanner

Compare to passive methods

Standard resolution depth frames in video frame rate

More robust to difficult shooting conditions

Drawbacks

Poor quality impedes the depth-based tasks to give full play to their potential performances


High Quality Depth Data are Important

A lot of applications require high quality depth data

Spatiotemporal depth video enhancement is necessary

Depth data cannot perform structural regularization by their own

If accompanied by synchronized RGB data

multi-modal structural features shared by texture and geometry enable guidance from the texture features to regularize the depth maps


Depth is NOT Texture

Depth links to the 3D geometry of the captured scene

Learn effective methods to encode these observations

Spatial relationshipsbetween objects

Depth ordering

Occlusion reasoning

Object segmentation

Geometric structures inside each object

Piecewise smoothness

Distinctive discontinuities


Goals

Explore effective ways to render robust spatiotemporal RGB-D depth video enhancement

Learn specific treatments compatible to 3D geometry forenhancement and depth-based applications

Employ probabilistic approaches to model these tasks


Hybrid Geometric Hole Filling Strategy

for Spatial Enhancement

Spatial RGB-D Enhancement


Introduction

enhanced depth image

Low resolution Noise & outliers Depth missing holes Structure distortions

RGB-D images upsampled raw depth image

? High definite Structure optimized Complete


Introduction

Observations

Co-occurrences between depth discontinuities and image edges

Homogeneous texture patterns have similar 3D geometries


Hybrid Geometric Hole Filling Strategy

Filtering-based Depth Interpolation

Segment-based Depth Propagation

Hole Filling

Depth Map Refinement

Input RGB-D pair Output RGB-D pair


Hole Partitioning

Up-sample low-resolution depth map into sparse grid

Pixels are divided into two parts

in hole region:

with depth values:

Further partition holes into two parts

based on valid depth pixels in its neighbors


Filtering-based Depth Interpolation

Filtering-based Depth Interpolation for region

Require enough depth info. in the neighbors to infer a reliable depth value

Joint Bilateral Filtering

Fill Fill whole image

× =


Depth Propagation under Segment Constraint

Depth Propagation for region

Segment constraint

Depth variation is smooth in an over-segmented RGB patch

One parametric surface model in one patch

Generate segments

Superpixel – simple linear iterative clustering (SLIC)

Hole patch

Patch with known depth

Partially filled patch

After filling



Filling the partially filled patches by surface fitting with RANSAC

Surface propagation for patches

Assign the surface model by finding its most similar RGB patch with known surface model in the neighborhood

The cost function models the statistical texture similarity and spatial distance

A greedy algorithm is exploited



Generate segments Fill in partially filled patches Fill in hole patches

Depth map refinement

Various filtering methods can be exploited here

A standard joint bilateral filtering is utilized for simplicity


Experimental Results

Middlebury dataset

Error metric: Bad Pixel Ratio (Δ𝑑 ≥ 1 as bad pixel)

[1] C. Richardt, et. al., Coherent spatiotemporal filtering, upsampling and rendering of RGBZ videos, CGF, 2012[2] L. Wang, et. al., Stereoscopic inpainting: Joint color and depth completion from stereo images, CVPR 2008.

RGB Images Depth images Ground truth Muti-res JBU[1] Wang et.al [2] Proposed method

BP: 8.35% BP: 3.65% BP: 3.33%

BP: 14.10% BP: 3.10% BP: 2.51%


Weighted Structure Filters

Based on Parametric Structural Decomposition

Spatial RGB-D Enhancement


Introduction

A variety of popular image filters are related to the local statistics of the input image

Median filter: catch half point at the cumulative local distribution

Mode filter: seek the global mode of the local distribution

Average filter: estimate the expectation of the local distribution


Introduction

Provided with a guidance feature map

Image intensity, patches, edge maps, …

These filters can be extended to joint weighted filters

Propagate local feature statistics into the target image

Various applications

Enhancement / de-noising / style manipulation / structure decomposition ….


Introduction

Disparity enhancement

Image denoising

JPEG artifact removalContrast enhancement

Image stylization

Joint depth upsampling


Weighted Distribution Estimation

The weighted distribution is

encodes both the spatial nearness and range affinity

measures the data compatibility

Brute-force implementation is of high computational cost

Computational cost depends on the number of samples 𝑔𝑖

✓ Hundreds of filtering operations are required to output a satisfactory distribution

✓ How to reduce it but do not distort the distribution?

𝑔𝑖


Structures in a Local Patch

cloud

object

tower

sky


Structures in a Local Patch

cloud

object

tower

sky

A patch of a natural image does not contain a large number of structures

Nearby patches share similar structures

Two pixels are similar if they both have high likelihoods to the same local structures

It is possible to construct the distribution of a local patch by the mixture model Lu Sheng, Thesis Oral Defense

A Probabilistic Kernel

Convention kernel for data compatibility

Assume the image is conveyed by several (e.g. 𝐿) structures throughout the image domain

Measure the difference between 𝑓𝑥 and 𝑓𝑦


A Probabilistic Kernel

Each structure is a probabilistic model

Two pixels are similar if they both have high responses to the 𝑙𝑡ℎ model

Assemble all models

Gaussian distribution with noise std


Weighted Distribution Estimation

Kernel

Gaussian, Kronecker delta, etc.

Distribution Estimation

Kernel

Local structure similarity

Distribution Estimation

Conventional Distribution The Proposed Distribution

Need hundreds of filtering operations

Only 𝐿 filtering operations to get 𝜓𝐱 𝑙 , 𝑙 ∈ ℒ

A mixture models!


Gaussian Models for the Local Structures

Gaussian distribution to define the models for the local structures

Uniformly Quantized Models (UQM)

Locally Adaptive Models (LAM)



Estimation of the Locally Adaptive Models

Hierarchical Clustering by Binary Space Partition Tree

1

𝑆1

3

𝑆3

2

𝑆2

7

6

4

5

+

+

+

- -

-with


Experimental Results & Discussions

The speedup of the proposed method

The gain is generally 2~4x faster for grayscale image 6~12x faster for color image Even faster for disparity map or cartoon-style

image due to their high structural homogeneity A manual threshold to stop model generation

Runtime comparison

Estimate the necessary LAM models on the BSD3000 dataset



Application-I: Disparity Enhancement (error metric: RMSE)

~16s

~4s<1s



Application-I: Disparity Enhancement

Cover more details & avoid staircase artifact Although small number of LAM models cannot cover all the details, it is

still superior to the UQM models


Raw Color frame

Spatial filter Spatiotemporal filter



Application-II: JPEG Block Artifact Removal

Piecewise smooth results and reduce staircase artifact but do not distort necessary structures



Application-III: Contrast Enhancement

source image

after structure-preserving

filtering

after detailenhancement



Application-IV: Joint Depth Map Upsampling


Spatiotemporal Enhancement

based on Static Structure

Spatiotemporal RGB-D Enhancement


Introduction

A raw depth video of a natural scene

Contains various complex and even unpredictable dynamic contents

Suffers spatial and temporal artifacts

Raw Kinect video

Color-coded Raw TOF video


Introduction

A raw depth video of a natural scene

Contains various complex and even unpredictable dynamic contents

Suffers spatial and temporal artifacts

After the spatial enhancement

Reduce artifacts in spatial domain

But introduce temporal flickering

No temporal consistency

Aggravate flickering artifacts

Raw Kinect video

Spatial JBF


Introduction

After a conventional spatiotemporal enhancement

Still contain temporal flickering

Distort depth variation on dynamic objects

Coherent spatiotemporal JBF

Spatial JBF

How to eliminate the temporal flickering while not distort the necessary depth

variation along dynamic objects?


Static Structure

A moving object

A static object

The static background

Kinect or another depth camera


Static Structure

A moving object

A static object



Captured depth map


Static Structure

Intrinsic structure underneath the captured scene

lies on or behind the surface of the input depth frame

A probabilistic medium to indicate whether a region is static

A moving object

A static object



static structure


Static Structure

Simple observations

Moving objects stay in its front

Static regions or visible background area are fused into it

A moving object

A static object



static structure


Static Structure Spatiotemporal Enhancement

Robust static/dynamic region detection by the static structure

Spatiotemporally enhance the static region with the static structure

Spatially optimized the dynamic foreground

Temporally coherent for static region and depth variation preserved

for dynamic contents

How to estimate static structure?


Generative Model for Static Structure

Camera center

Line of sight

Current static structure

Behind the structure

Before the structure

A Probabilistic Generative Model




If incoming depth belongs to

State-I: the static structure

Camera center

Line of sight


State-I






State-II: outliers in the front or moving objects

Camera center

Line of sight


State-II

is an indicate function that is equal to 1, when input argument is true and 0 vice visa






State-II: outliers in the front or moving objects

State-III: outliers rearward or revealedbackground

Camera center

Line of sight


State-III

is an indicate function that is equal to 1, when input argument is true and 0 vice visa




The likelihood of w.r.t. the given static structure

Gaussian prior over

Dirichlet prior over the frequency of each state

Camera center



Online Update Scheme


The posterior

is the set of previous depth samples

is the set of current samples

Camera center





The posterior

is the set of previous depth samples

is the set of current samples

If the input frame only contains the static scene and outliers, the updated static structure will be governed by the posterior, and we have

Its probable depth is

The reliability of the model is

Variational approximation for efficiencyCamera center

Updated static structure


Layer Assignment

Label the input depth frame into three layers

𝑙𝑖𝑠𝑠: agree with estimated static structure

𝑙𝑑𝑦𝑛: belong to dynamic objects

𝑙𝑜𝑐𝑐: refer to the previous occluded structure

𝑙𝑖𝑠𝑠 and 𝑙𝑜𝑐𝑐 defines the current static regions

Fully Connected Conditional Random Fields with effective inference based on real-time high-dimensional filters

𝒍𝒊𝒔𝒔

𝒍𝒅𝒚𝒏

𝒍𝒐𝒄𝒄


Layer Assignment & Online Update of the Static Structure

(a)

(b)

(c)

(d)

(e)

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#5

#5

#5

#5

#5

Raw depth

Raw color

Layer assign.

Depthstatic

struct.

Colorstatic

struct. Lu Sheng, Thesis Oral Defense

Layer Assignment & Online Update Update of the Static Structure

#1 #2 #3 #4 #5

(a)

(b)

(c)

#1 #2 #3 #4 #5

#1 #2 #3 #4 #5

Raw depth

Layer assign.

Depthstatic

struct.


Spatiotemporal Depth Video Enhancement

Input data (𝑡)

Layer Assignment

VariationalApproximation

Spatial Enhancement

Static Structure (𝑡)

Static Structure (𝑡 − 1)

Spatiotemporal Depth Video Enhancement

Online Static Structure Updating Scheme Enhanced depth frame

(𝑡)Lu Sheng, Thesis Oral Defense

Result Comparisons

(a) Raw RGB-D videos

(b) Proposed method (c) Lang et al. [3]

[1] C. Richardt, et. al, “Coherent spatiotemporal filtering, upsamplingand rendering of RGBZ videos,” Computer Graphics Forum, 2012.

[2] D. Min et al, “Depth video enhancement based on weighted mode filtering,” TIP, 2012.

[3] M. Lang et al, “Practical temporal consistency for image-based graphics applications,”TOG. 2012.

superior in static scene reconstruction dynamic object enhancement


Result Comparisons

(a) Raw RGB-D videos (b) Proposed method

(c) CSTF [1] (d) WMF [2] (e) Lang et al. [3]Lu Sheng, Thesis Oral Defense

Color frames

Depth frames

CSTF [1]

WMF [2]

Lang et al. [3]

Ours

Closed-upsLu Sheng, Thesis Oral Defense

Result Comparisons





dyn_kinect_1 dyn_kinect_2


Result Comparisons

(a) dyn_kinect_2 (b) dyn_kinect_3

Color

Depth

Lang et al. [3]

Ours

dyn_kinect_1 dyn_kinect_2


Applications

Application-I: Background Subtraction

color image by raw depth image by the proposed method

Lang et al. [3] CSTF [1] WMF [2]Lu Sheng, Thesis Oral Defense

Applications

Application-II: Novel View Synthesis

(a) color image (b) raw depth image (c) enhanced depth image

(d) by raw depth image (e) by static structure (f) by enhanced depth imageLu Sheng, Thesis Oral Defense

A Generative Model

for Robust 3D Facial Pose Tracking

Depth-based Application


Introduction

Why facial pose tracking interesting?

Immersive Video Communication

3DTV & Free-viewpoint TV

VR / AR and etc.

With expression added?

Image/Video Editing

Performance Capturing and etc.


Introduction

How to let it

Markerless

No explicit or manual markers

Realtime

Cannot afford sophisticated correspondence estimation & face shape representation

Robustness and Smoothness

Robust to illumination variations, occlusions & outliers

Robust to varying facial expressions

Temporally coherent tracking

Adaptive to any user on-the-fly without manual calibration


Introduction

RGB based facial pose tracking has been successfully performed under optimally constrained scenes

It is fragile for unconstrained capturing conditions

Illumination variations

Shadows

Large and severe occlusions

Common in numerous applications in consumer level


Introduction

Commodity real-time range sensors

Explicitly tell the space relationship

Irrelevant to illumination variations & shading

Easier inference for occlusions

BUT new challenges arisen Noise, missing values &

outliers Complex occlusions Varying expressions Online user adaptation


The Proposed Method

A framework that

unifies pose tracking and face model adaptation on-the-fly

offers accurate, occlusion-aware and uninterrupted 3D facial pose tracking

A visibility constrained criterion for

correspondence-free and occlusion-aware rigid facial pose estimation

A generative multilinear face model

both models the identity and expression

facilitates the online face model personalization without the interference caused by the expression variations


Probabilistic 3D Face Parameterization

Multilinear Face Model

Unifies the representations of identity and expression

Models the face dataset as a 3D tensor

Decomposes it by High-order singular value decomposition

Any face can be reconstructed as



Generative models for face modeling

Model the uncertainties of the shape, identity, and expression

Feasible to simulate, predict the face identity and expression

Enable group-wise rigid facial pose estimation suitable for any faces

The generative face model can be learned from a training dataset

FaceWarehouse Dataset

150 identity, 47 expressions Different ages, genders, races … Its diversity lets the learned face

model cover most common identities and expressions



Identity and Expression Priors

Multilinear Gaussian Face Model

Learned from the FaceWarehouse datasettogether with the core tensor

for for

(b) Variance by (c) Variance by

mm

(a) Mean face (d) Variance by Lu Sheng, Thesis Oral Defense

Probabilistic Facial Pose Tracking

Rigid PoseTracking

Identity Adaptation

Input

Output

Identity distribution

Pose Parameters Face Model


Transform a canonical face model to match the input point cloud

The warped face model has the distribution

Robust Facial Pose Estimation

(b) Variance by (c) Variance by

mm

(a) Mean face (d) Variance by

Face model in canonical coordinate

inputpoint cloud

scale rotation translation


Ray Visibility Constraint

Occlusions are inevitable in uncontrolled scenarios

Occluded human faces are always behind the occluding objects, like hairs,fingers/gestures, glasses, accessories

Self-occlusion Occluded by hair

Occluded by hand/gestureOccluded by accessories




If correctly aligned

the visible face model points are those that overlap with the input point cloud

the rest face model points should always be occluded by the input point cloud

(a) Case-I (b) Case-II (c) Case-III

Face point is visible Face point is occluded

Should be prevented



Connect point pair along a ray

their distance along the surface of the input data

The distribution of one face model point ismapped along the surface normal direction

The face model point is visible

The face mode point is occluded visible

occluded

face distribution

line-of-sightcamera



Ray Visibility Score

Measures the compatibility between the distributions of the face model andthe input point cloud

Applies the Kullback-Leibler Divergence

data distribution

projected model distribution

The minimization of ray visibility score results in the optimalcompatibility between these two distribution

Quasi-Newton method & further refined by particle swarm optimization

Occlusions receive constant penalties

Visible points punish the misalignment & model uncertainties

More robust than ICP-based cost function

solver



Result comparison with the generic face model

(a) Color image (b) Point cloud (c) Initial alignment

(d) ICP (e) RVC + ML (f) RVS (g) RVS + PSO



More results with the generic face model

(a) Color image (b) Point cloud (c) Initial alignment (d) Ours

no explicit correspondences

handle occlusions even with apoor initial pose

less vulnerable to bad localminima

PSO increases the robustness


Online Identity Adaptation

Variational Approximation

The face model is identified by the identity distribution

It can be online estimated through assumed density filtering (ADF)

The data likelihood A mixture distribution encoding the model and outlier The model fitting function is robust to quantization with a modified

projection distance

The variance of identity is enlarged per frame to prevent overfitting Lu Sheng, Thesis Oral Defense

Online Identity Adaptation

(a)

(b)

(c)

Results of online model adaptationLu Sheng, Thesis Oral Defense


Experiments on public depth-based facial pose datasets

Biwi dataset ICT-3DHP dataset

Dataset 𝑵𝒔𝒆𝒒 𝑵𝒇𝒓𝒎 𝑵𝒔𝒖𝒃𝒋 occlusions expressions 𝝎𝒎𝒂𝒙

Biwi 24 ~15K 25accessories

hairneutral ~ slight

±75 yaw±60 pitch

ICT-3DHP 10 ~14k 10accessories

hairslight ~

exaggerated±75 yaw±45 pitch



Robust to profiled faces due to large rotations and occlusions from hair andaccessories.

profiled face profiled faceocclusions

occlusions expressions profiled faceocclusions



The proposed system is also effective to the expression variations

Ray visibility constraint

efficiently infer the occlusionsagainst the face model

optimize the visible face areaagainst the occlusions

Personalized face model

enables compact fitting

robust to changes in thepersonalized expressions



Adaptation between different users

Three different identities are presented in three adjacent frames



Comparison with the state-of-the-arts

MethodErrors

Yaw (deg) Pitch (deg) Roll (deg) Trans (mm)

Ours 2.3 2.0 1.9 6.9

RF 8.9 8.5 7.9 14.0

Martin 3.6 2.5 2.6 5.8

CLM-Z 14.8 12.0 23.3 16.7

TSP 3.9 3.0 2.5 8.4

PSO 11.1 6.6 6.7 13.8

Meyer 2.1 2.1 2.4 5.9

Li* 2.2 1.7 3.2 -

*This method is based on RGB-D data

Discriminative: RF Model fitting: CLM-Z, PSO, Martin et al.,

Meyer et al. Feature-based: TSP RGB-D: Li*

MethodErrors

Yaw (deg) Pitch (deg) Roll (deg)

Ours 3.4 3.2 3.3

RF 7.2 9.4 7.5

CLM-Z 6.9 7.1 10.5

Li* 3.3 3.1 2.9

Biwi dataset ICT-3DHP dataset


Conclusion


Conclusions

Hybrid Geometric Hole filling Strategy for Spatial enhancement

• Hybrid hole filling merging the interpolation and parametric structure propagation

• A novel texture-constrained patch matching method for a robust structure inference

Weighted Structure Filters Based on Parametric Structural Decomposition

• An efficient distribution estimation that are adaptive to local image structure

• Accelerating joint weighted filters without structural distortions


Conclusions

Spatiotemporal Enhancement based on Static Structure

• Robust temporally consistent depth enhancement based on a probabilistic static structure of the captured scene

• The dynamic content is enhanced spatially while the static region favors a long-range spatiotemporal optimization

A Generative Model for Robust 3D Facial Pose Tracking

• A robust depth-based facial pose tracking system with an adaptive face model personalization

• The multilinear generative face model and the visibility-constrained rigid pose estimation improve the robustness


Publications

Lu Sheng, King Ngi Ngan, Chern-Loon Lim and Songnan Li, Online Temporally Consistent Indoor Depth Video Enhancement via Static Structure, TIP, 2015.

Songnan Li, King Ngi Ngan, Raveendran Paramesran and Lu Sheng, Real-time Head Pose Tracking with Online Face Template Reconstruction, TPAMI, 2016.

Lu Sheng, Tak-Wai Hui and King Ngi Ngan, Accelerating the Distribution Estimation for the Weighted Median/Mode Filters, ACCV, 2014.

Lu Sheng, Songnan Li and King Ngi Ngan, Temporal Depth Video Enhancement Based On Intrinsic Static Structure, ICIP, 2014.

Lu Sheng, King Ngi Ngan and Songnan Li, Depth Enhancement Based On Hybrid Geometric Hole Filling Strategy, ICIP, 2013.

Chi Ho Cheung, Lu Sheng and King Ngi Ngan, A disocclusion filling method using multiple sprites with depth for virtual view synthesis, ICMEW, 2015.

Songnan Li, King Ngi Ngan and Lu Sheng, Screen-camera Calibration Using a Thread, ICIP, 2014.

Songnan Li, King Ngi Ngan and Lu Sheng, A Head Pose Tracking System Using RGB-D Camera, ICVS, 2013.

Lu Sheng, Jianfei Cai and King Ngi Ngan,, TIP, in preparation. A Generative Model for Robust 3D Facial Pose Tracking, TIP, in preparation.

Lu Sheng and King Ngi Ngan, Weighted Structural Prior for Structure-preserving Image and Video Applications, TIP, in preparation. Lu Sheng, Thesis Oral Defense

Thanks to

My supervisor Prof. King Ngi NganProf. Jianfei Cai

Committee members Prof. Wai Kuen Cham, Prof. Thierry Blu,

and Prof. Kwanghoon Sohn

My lovely IVP labmates

& My sweet families!



Cost function construction

Randomly select 𝑘 sub-patches in each patch

Estimate similarity between two sub-patches

Calculate the cost of 𝑗𝑡ℎ sub-patch of 𝑢 with 𝑣, and find the 𝑣∗ patch with the minimum cost

Form a histogram indicating the number of sub-patches in 𝑢 that matches with 𝑣

Add spatial constraint, the cost is



Kernel Specification

Distribution is a mixture of Gaussian models

Constant time filter: Domain transform filter [1] Guided image filter [2]

[1] K. He et al., ECCV 2010[2] E. Gastal and M. Oliveira, ACM ToG 2011

Noise variance



Noise std



Variational Parameter Estimation

Factorize the posterior into independent Gaussian and Dirichlet distributions

The reliability of the model

The probable depth is

The posterior can be approximated by

Recursive estimation is possible!



Variational Parameter Estimation

Factorize the posterior into independent Gaussian and Dirichlet distributions

The posterior can be approximated by

Moment matching to estimate the hyperparameters

Closed-

form

solutions!


rgb d video enhancement and applications · lu sheng, thesis oral defense. structures in a local...

Documents