kliep-based density ratio estimation for semantically

8
KLIEP-based Density Ratio Estimation for Semantically Consistent Synthetic to Real Images Adaptation in Urban Traffic Scenes Artem Savkin 1,2 TUM, BMW Federico Tombari 1,3 TUM, Google Abstract— Synthetic data has been applied in many deep learning based computer vision tasks. Limited performance of algorithms trained solely on synthetic data has been approached with domain adaptation techniques such as the ones based on generative adversarial framework. We demonstrate how adversarial training alone can introduce semantic inconsis- tencies in translated images. To tackle this issue we propose density prematching strategy using KLIEP-based density ratio estimation procedure. Finally, we show that aforementioned strategy improves quality of translated images of underlying method and their usability for the semantic segmentation task in the context of autonomous driving. I. I NTRODUCTION Transition of deep learning from being a mere research topic to application in a wide spectrum of industrial task made availability of comprehensive training data excep- tionally crucial. Certain safety critical contexts additionally have particular requirements on reliability. In deep learning based computer vision common approach to achieve such a capacious training corpus would be just to acquire and label more data when it needs to cover any specific case. For autonomously driving systems that means to drive specific scenarios and improve models on newly captured data. However due to high costs (new scenarios should be driven and manually labeled), corner cases (are rare to capture) and near-accident scenarios (ethical issues) this strategy is not always fully applicable in autonomous driving. In this regard synthetically generated data seem to be a natural solution to for the stated problem. And the straight- forward approach would be to utilize rendering engines to generate data which could be used in computer vision tasks. This not only could potentially extend the variability of training data at reduced cost but also minimize manual effort in labeling data. Thus many researchers focused on utilizing 3D rendered imagery in theirs approaches in computer vision tasks [45]. Although rendered training data provides an opportunity to simulate various scenarios it reveals limited applicability in real-world environment. In machine learning one commonly considers training and validation data to be independent and identically distributed (iid). This is however clearly does not hold for synthetic-real setup, as even photo- realistically rendered images reveal bias on the underlying domain. Deep models trained solely on rendered images show poor performance when evaluated on real data [33]. 1 TU Munich, Boltzmannstr. 3, 85748 Munich (Germany) [email protected]; [email protected] 2 BMW AG, Petuelring 130, 80809 Munich (Germany) 3 Google, Brandschenkestrasse 110, 8002 Zurich (Switzerland) Original synthetic Translated to real Fig. 1. Example of semantical inconsistency introduced by adversarial training under covariate shift. This situation is commonly referred to as domain shift and is considered to be the main reason for such performance. Particular case where input distribution for a model changes is referred to as covariate shift and is addressed by means of domain adaptation. Recent domain adaptation techniques enables to improve performance compared to models trained synthetically but still can not achieve same-domain results. State-of-the-art domain adaptation methods such as DTN [44], FCN ITW [18] or DualGAN [51] rely on generative adversarial network [11], which employs adversarial training for translating between source and target domains [19]. During such training two networks generator and classifier (discriminator) perform a minimax game where the first one learns to conduct certain perturbations in the input samples from source domain {x s i }∈ D s so that discriminator cannot distinguish them from the target domain samples {x t j }∈ D t . Thus GAN indirectly imposes target distribution upon the generated distribution [11]. Adversarial training being very efficient in adaptation tasks is a subject for covariate shift itself and it does not guarantee that a non-linear transforma- tion performed by a generator keeps underlying semantical structure of the source inputs unchanged. Regularities in arXiv:2105.12549v1 [cs.CV] 26 May 2021

Upload: others

Post on 10-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KLIEP-based Density Ratio Estimation for Semantically

KLIEP-based Density Ratio Estimation for Semantically ConsistentSynthetic to Real Images Adaptation in Urban Traffic Scenes

Artem Savkin1,2

TUM, BMWFederico Tombari1,3

TUM, Google

Abstract— Synthetic data has been applied in many deeplearning based computer vision tasks. Limited performance ofalgorithms trained solely on synthetic data has been approachedwith domain adaptation techniques such as the ones basedon generative adversarial framework. We demonstrate howadversarial training alone can introduce semantic inconsis-tencies in translated images. To tackle this issue we proposedensity prematching strategy using KLIEP-based density ratioestimation procedure. Finally, we show that aforementionedstrategy improves quality of translated images of underlyingmethod and their usability for the semantic segmentation taskin the context of autonomous driving.

I. INTRODUCTION

Transition of deep learning from being a mere researchtopic to application in a wide spectrum of industrial taskmade availability of comprehensive training data excep-tionally crucial. Certain safety critical contexts additionallyhave particular requirements on reliability. In deep learningbased computer vision common approach to achieve sucha capacious training corpus would be just to acquire andlabel more data when it needs to cover any specific case. Forautonomously driving systems that means to drive specificscenarios and improve models on newly captured data.However due to high costs (new scenarios should be drivenand manually labeled), corner cases (are rare to capture) andnear-accident scenarios (ethical issues) this strategy is notalways fully applicable in autonomous driving.

In this regard synthetically generated data seem to be anatural solution to for the stated problem. And the straight-forward approach would be to utilize rendering engines togenerate data which could be used in computer vision tasks.This not only could potentially extend the variability oftraining data at reduced cost but also minimize manual effortin labeling data. Thus many researchers focused on utilizing3D rendered imagery in theirs approaches in computer visiontasks [45]. Although rendered training data provides anopportunity to simulate various scenarios it reveals limitedapplicability in real-world environment. In machine learningone commonly considers training and validation data to beindependent and identically distributed (iid). This is howeverclearly does not hold for synthetic-real setup, as even photo-realistically rendered images reveal bias on the underlyingdomain. Deep models trained solely on rendered imagesshow poor performance when evaluated on real data [33].

1TU Munich, Boltzmannstr. 3, 85748 Munich (Germany)[email protected]; [email protected]

2BMW AG, Petuelring 130, 80809 Munich (Germany)3Google, Brandschenkestrasse 110, 8002 Zurich (Switzerland)

Original synthetic Translated to realFig. 1. Example of semantical inconsistency introduced by adversarialtraining under covariate shift.

This situation is commonly referred to as domain shift andis considered to be the main reason for such performance.Particular case where input distribution for a model changesis referred to as covariate shift and is addressed by meansof domain adaptation. Recent domain adaptation techniquesenables to improve performance compared to models trainedsynthetically but still can not achieve same-domain results.

State-of-the-art domain adaptation methods such as DTN[44], FCN ITW [18] or DualGAN [51] rely on generativeadversarial network [11], which employs adversarial trainingfor translating between source and target domains [19].During such training two networks generator and classifier(discriminator) perform a minimax game where the first onelearns to conduct certain perturbations in the input samplesfrom source domain {xsi} ∈ Ds so that discriminator cannotdistinguish them from the target domain samples {xtj} ∈ Dt.Thus GAN indirectly imposes target distribution upon thegenerated distribution [11]. Adversarial training being veryefficient in adaptation tasks is a subject for covariate shiftitself and it does not guarantee that a non-linear transforma-tion performed by a generator keeps underlying semanticalstructure of the source inputs unchanged. Regularities in

arX

iv:2

105.

1254

9v1

[cs

.CV

] 2

6 M

ay 2

021

Page 2: KLIEP-based Density Ratio Estimation for Semantically

the target data learned by the discriminator are implicitlyinflicted on generated samples.

Examples where adversarial network translates samplessemantically inconsistent could be observed on the figure 1.Here one can see vegetation patches imposed on sky regionsor road users removed from the traffic scene. As seen on thefigure 1 the network introduces semantically mismatchingartifacts in order to reconstruct the target distribution. Suchmutations in a semantic layout of the image reduce usabilityof generated data for computer vision tasks e.g. semanticsegmentation or detection. Semantically inconsistent adap-tation is especially critical in the area of traffic scenesunderstanding as it produces unreliable training data.

Multiple works investigated the ways to mitigate thisproblem and ensure that macro-structure of translated imagesremains consistent. They introduced dedicated constraintssuch as self-regularization loss [38] semantic consistencyloss [53], regularization by enforcing bijectivity [18], ormodeling a shared latent space [26], [25], or semantic awarediscriminator [24] to reduce undesired changes.

In this work we propose density ratio based distributionpre-matching in ensemble with cyclic-consistency loss foradversarial synthetic to real domain adaptation in trafficurban scenes. For the density ratio estimation we employKullback-Leibler importance estimation procedure (KLIEP)[42]. This helps to keep semantic consistency of translatedimages and improves visual quality of generated samples.Being evaluated on the particular task of semantic segmen-tation it reveals better average performance and performancefor main classes. It does not affect the stability of adversarialtraining as it avoids additional constraints and losses.

II. RELATED WORK

Synthetic data has found its application in variety ofcomputer vision tasks. Hattori et al. used spatial informationof virtual scene to create surveillance detector [15].

It also has been widely used for evaluation purposes. [20]used virtual worlds to test feature descriptors and [14] usedsynthetically generated environments for evaluation on suchtasks as visual odometry or SLAM.

There are plethora of research works which utilized CADmodels for computer vision tasks. Sun and Saenko in[43] investigated 3D models for 2D object detection and[1] to establish part based correspondences between 3DCAD models and real images. [30] showed effectivenessof augmentation of training data with crowd-sourced 3Dmodels and [31] extended part models to include viewpointand geometry information for joint object localization andviewpoint estimation.

Another vivid research area which utilizes rendered data ismotion and pose estimation. For example, [37] used realisticand highly varied training set of synthetic images to learnmodel invariant to body shape, clothing and other factors.[48] presented SURREAL large-scale dataset with realisti-cally generated images from 3D human motion sequences.

Synthetically generated data seem to be especially usefulwhen labeling of real data is tedious. This is the casewith pixel dense tasks such as flow and depth estimation.Dosovitskiy et al. [7] generated unrealistic synthetic datasetcalled Flying Chairs and showed good generalization abilitiesof flow estimators. [13] focused on depth-based semantic perpixel labeling an [29] sets up a on-the-fly rendering pipelineto generate cluttered rooms for indoor scene understanding,which is also a subject of investigation in [36]

Considering almost eternal variability of traffic scenariosit is natural that synthetic data extended its area of applica-tion to traffic scenes understanding. In particular, pedestriandetection got a lot of attention. [28], [23], [49] addressedthe question transfer learning trying to answer whether apedestrian detector learned in virtual environment could workwith real images.

There are also certain synthetic datasets for traffic scenes.[12] provided accurate flow, depth and segmentation ground-truth for approximately 8,000 frames. Ros et al. developedone of the major datasets in this area called SYNTHIA[33]. Gaidon et al. in [10] introduced a virtual-to-real clonemethod to create so called ”proxy virtual worlds” and re-leased ”Virtual KITTI” dataset. Synthetic dataset with thehighest variance in scenes and scenarios counting almost25,000 densely annotated frames provided in [32].

Severe disadvantage of those utilizing rendered data is thatmodels trained on synthetic data generalize rather poor inreal world. This issue as already mentioned is commonlyknown as domain shift [41] and addressed by domain adap-tation techniques. Recent synthetic to real domain adap-tation techniques could be roughly fall into 2 categoriesand both of them commonly rely on adversarial training.One category incorporates adversarial loss directly into tasklearning procedure. They commonly use both synthetic andreal images as an input producing segmentation maps (orany other CV task output). Such models normally do notgenerate additional data. Although adversarial loss assist tobridge the gap between synthetic and real traffic scenesit is not cut for target accurate classification or detectionlearning task. Thus, multiple approaches introduce variousregularization techniques. [35] and [27] utilizes discrepancyloss to generate target features close to source. [47] appliesadversarial loss directly on learned segmentation featuresmaps. Another examples are [34], [50], [4] and [54], [5].

Another category of methods focuses on translation ofsynthetic images to real ones and using them afterwards fortarget prediction learning, they also are called generative.Here adversarial loss allows to generate visually pleasing im-ages of high resolution by minimizing the distance betweengenerated and target distributions. Many researchers focusedtheir efforts to design dedicated constrains in adversarialmodels to overcome the mismatch problem. CycleGAN[53] uses cyclic-consistency in addition to adversarial loss.CyCADA [17] improves on top of [53] by integrating thesegmentation loss and [9] introduced geometry consistencyloss. Some works introduced disentanglement of content andappearance in a latent space [25], [39].

Page 3: KLIEP-based Density Ratio Estimation for Semantically

No adaptation CyCADA CycleGAN UNIT OursFig. 2. Examples of images adapted from synthetic to real.

In our approach we focus on synthetic to real imagetransfer and handle domain shift issue by using the impor-tance weighting technique based on density ratio estimation.Certain works utilize importance weights [16] or kerneldensity [40] to improve GAN training. We employ a tech-nique named KLIEP [42] to pre-match distribution densitiesalongside with adversarial and cycle consistency loss, thatallows us to perform semantically consistent synthetic toreal domain adaptation in unsupervised manner. We showthat our model shows significant performance improvementin data generation compared to state-of-the-art synthetic toreal generative models (second category). This is evaluatedqualitatively on on the task of semantic segmentation. Ourablation study shows how KLIEP based importance pre-matching affects adversarial training of our model.

III. APPROACH

A. Problem Definition

Our setup consists of pairs of input images x together withcorresponding labels y from synthetic dataset: {(xsi , ysi )}

Nsi=1

and pairs {(xrj , yrj )}Nrj=1 from real. We denote input samples

{xsi} from the synthetic domain as Ds and {xrj} from realdomain as Dr and target domain as Ds:

Ds = {xsi}Nsi=1 (1)

Dr = {xrj}Nrj=1 (2)

Let’s consider variable x in the input distribution spaceX taking values xsi , which are independent and identicallydistributed and follows probability distribution Ps(x):

xsi ∈ Xs ⊂ X , i = 0, 1, ..., Ns

{xsi}Nsi=1 ∼ Ps(x)

(3)

The real samples xrj in turn follow different probabilitydistribution Pr:

xri ∈ Xr ⊂ X , i = 0, 1, ..., Nr

{xrj}Nrj=1 ∼ Pr(x)

(4)

In a sim-to-real setup both marginal distributions of {xsi}and {xtj} are generally different: Ps(x) 6= Pr(x). Thissituation is addressed as a covariate shift [41] meaning thatunder the condition that x is equivalent for both distributions,the conditional probability P (y) is indistinguishable for xs

and xr.Synthetic to real domain adaptation could be formalized

as finding of a mapping function which translates samplesfrom sub-space of synthetic domain into sub-space anotherg : Xs → Xr.

Typically, such mapping function g is approximated bya neural network, which training relies on adversarial loss(GAN) [11] in image space. During adversarial training onemodel called discriminator gets input samples from realdistribution Pr and from generative distribution P (g(xs)).During this zero-sum game discriminator learns to distin-guish real samples from synthesized by generator, which inturn learns to generate samples which are harder to distin-guish. When training converges g imposes real distributionon transformed samples P (g(xs)) = Pr.

Adversarial loss applied in the image space works verywell in making generated images similar to target ones.Thus, discriminator learns regularities in the real domainand imposes perturbations in the generated images. This notonly makes generated images target-alike w.r.t. appearancebut also introduces mismatch in content and semantic layout.

B. Importance function

To reduce semantic inconsistency in transferred sampleswe intend to correct the distribution bias between syntheticand real datasets. To achieve that and mitigate the impact

Page 4: KLIEP-based Density Ratio Estimation for Semantically

Met

hod

Acc

urac

y

mea

nIo

U

road

side

wal

k

build

ing

wal

l

fenc

e

pole

traf

ficlig

ht

traf

ficsi

gn

vege

tatio

n

terr

ain

sky

pers

on

ride

r

car

truc

k

bus

trai

n

mot

orbi

ke

bicy

cle

CS [6] 94.3 67.4 97.3 79.8 88.6 32.5 48.2 46.3 63.6 73.3 89.0 58.9 93.0 78.2 55.2 92.2 45.0 67.3 39.6 49.9 73.6

PfD [32] 62.5 21.7 42.7 26.3 51.7 5.5 6.8 13.8 23.6 6.9 75.5 11.5 36.8 49.3 0.9 46.7 3.4 5.0 0.0 5.0 1.4

CycleGAN [53] 82.5 32.4 81.8 34.7 73.5 22.5 8.7 25.4 21.1 13.5 71.5 26.5 41.7 50.1 7.3 78.5 20.5 19.5 0.0 12.5 6.9

CyCADA [17] - 38.8 82.4 38.9 79.0 26.1 19.3 33.2 32.4 21.3 73.9 37.1 61.8 56.2 17.6 78.5 10.0 31.0 10.7 13.8 14.2

UNIT [25] - 36.1 79.2 28.5 75.9 22.1 13.6 27.0 29.7 18.8 75.9 25.8 56.3 57.5 21.8 81.1 18.9 21.6 1.5 13.7 17.2

Our - 39.7 84.1 34.6 80.5 24.4 17.7 32.5 31.1 27.4 79.7 26.9 68.7 58.8 21.1 84.4 22.6 21.2 1.0 20.1 17.8

CS [6] 95.5 75.6 97.9 83.5 91.6 56.5 61.2 54.8 63.9 73.6 91.3 59.9 93.2 77.7 60.1 94.0 79.3 87.0 76.1 61.0 73.2

PfD [32] 82.9 40.0 79.2 26.9 79.5 19.1 27.4 13.8 23.6 6.9 75.5 11.5 36.8 49.3 0.9 46.7 3.4 5.0 0.0 5.0 1.4

CycleGAN [53] 87.7 46.0 85.4 39.0 85.4 42.3 26.3 37.8 40.1 24.8 81.3 28.8 79.4 62.5 27.2 85.5 32.9 44.3 0.0 29.1 17.4

CyCADA [17] 88.5 48.7 89.4 45.1 85.3 42.1 23.0 39.3 39.1 25.9 84.4 42.7 79.9 63.6 29.7 86.3 35.3 44.6 11.7 30.8 26.6

UNIT [25] 86.8 47.6 85.2 33.3 85.4 46.8 28.7 35.8 36.2 26.4 83.1 36.5 81.8 63.2 27.0 88.5 43.0 50.9 0.0 30.1 19.6

Our 88.7 48.1 89.7 40.9 85.9 43.2 21.0 35.7 37.5 29.8 84.3 33.3 87.4 62.0 26.7 88.0 43.4 53.6 0.0 25.4 20.8

CS [6] - 70.8 97.3 79.5 90.1 40.1 50.7 51.3 56.1 67.0 90.6 59.0 92.9 76.7 54.2 92.9 68.8 80.6 68.5 58.0 71.7

ROAD [4] - 35.9 85.4 31.2 78.6 27.9 22.2 21.9 23.7 11.4 80.7 29.3 68.9 48.5 14.1 78.0 19.1 23.8 9.4 8.3 0.0

Adapt [46] - 41.4 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 29.5 32.5

Ours - 42.0 81.4 28.6 80.4 27.4 12.0 32.9 38.3 28.6 82.5 29.4 78.6 63.4 16.7 84.0 25.5 41.3 0.2 33.6 12.4

TABLE ISEMANTIC SEGMENTATION RESULTS FOR DRN26 (TOP), DEEPLABV3 (MID) AND DEEPLABV2 (BOTTOM) TRAINED ON TRANSLATED IMAGES.

of covariate shift on the learning procedure we employ theimportance weighting concept. The key idea of importanceweighting is to consider informative training samples basedon their importance. Given density functions of both syn-thetic and real distributions, importance function could bedefined as:

ω(x) =pr(x)

ps(x)(5)

C. Density ratio estimation

In unsupervised synthetic-to-real image transfer it is hardto estimate probability densities both for the source domainand the real domain without prior information about dis-tributions. This could be however avoided when addresseddensity ratio estimation directly. We rely in our approachon estimation technique called Kullback-Leibler importanceestimation procedure (KLIEP), which has been introducedin [42]. This procedure focuses directly on density ratioestimation between source and target densities instead ofestimating them separately.

KLIEP aims to model the importance function ω(x) as:

ω(x) =∑l

αlϕl(x), (6)

where parameters αl are supposed to be learned from sam-ples xsi (source) and xtj (target) and ϕl(x) are the basisfunctions. The estimation model ω(x) approximates thetarget density: pt(x) = ω(x)ps(x). Parameters αl of themodel should be calculated in a way that Kullback-Leiblerdivergence from pt(x) to pt(x) is minimized.

KL(pt||pt) = Ext

[log

pt(x)

ω(x)ps(x)

]= Ext

[log

pt(x)

ps(x)

]− Ext[log ω(x)]

(7)

Since the first term does not depend on α we consider onlythe latter one:

Ext[log ω(x)] =

1

Nt

Nt∑j

log∑l

αlφl(xtj) (8)

Thus, in order to minimize KL divergence we can maximizethe (8) w.r.t. α under following constraint:

Exs[ω(x)] =

1

Ns

Ns∑i

∑l

αlφl(xsi ) = 1, (9)

This constraint comes from the fact that pt(x) is a prob-ability density function itself. In that way we defined ouroptimization problem:

maximizeαl

∑j

log∑l

αlϕ(xtj) (10a)

subject to∑l

αl∑i

ϕl(xsi ) = Ns, (10b)

α ≥ 0. (10c)

We use RBF kernel Kσrcentered in target samples xtj

and width σt (found by grid search maximizing ) to cal-culate respective importance for source samples. Analogouscalculations we perform reverse direction:

ω(xs) =∑l

αlKσr (xs, xtl),

ψ(xt) =∑k

βkKσs(xt, xsk)

(11)

We exploit gradient ascent with constraint satisfaction tofind ω(xs) and ψ(xt) in order to complete pre-matching ofmarginal distributions.

Page 5: KLIEP-based Density Ratio Estimation for Semantically

D. Weighted loss

In our adversarial domain adaptation approach we also relyon cycle-consistency loss, as it reveals robust training andproduces consistent results in high resolution. Thus, our lossfunction is constructed by importance weighted adversariallosses [11] for both gr : Xs → Xr and gs : Xr → Xs andimportance weighted cyclic-consistency losses [53]:

L = LIWAdv + LIWCyc

= Exr[ψ(yr) log dr(x

r)]

+ Exs[ω(ys) log(1− dr(gr(xs))]

+ Exs[ω(ys) log ds(x

s)]

+ Exr[ψ(yr) log(1− ds(gs(xr))]

+ Exr[ψ(yr)‖gr(gs(xr))− xr‖]

+ Exs[ω(ys)‖gs(gr(xs))− xs‖]

(12)

IV. EXPERIMENTS

To evaluate our approach we employ 2 experimentalsetups: toy example and real data. In the first one we simulatesource and target distributions by Gaussian and uniformsamplers. In the real setup we experiment with large scaledatasets in simulated and real traffic scene environments.

A. Toy Example

To facilitate the toy experiment we generate source andtarget datasets from uniform and normal distributions re-spectively. In this particular example both datasets consist of10,000 random vectors of size 300. Those vectors of targetdataset have been sampled from Gaussian distribution withmean value 7.0 and standard deviation 0.5, those of sourcedataset from uniform distribution in the segment [0, 10).Histograms for both distributions could be observed in thefigure 3 depicted with blue and red.

As a baseline for the toy example we train vanilla GANmodel [11] on source and target dataset for 40 epochs withbatch size of 200.The generator in the architecture of ourchoice consists of number of linear activations and scaledexponential linear units [21], while the discriminator net usedadditional sigmoid activation. We use SGD optimizer for thegenerator as well as for the discriminator with learning rates8e−3 and 4e−3 respectively. The goal of the network is totransform input source distribution in a way that generatedand target distributions are similar.

We extend the aforementioned vanilla GAN with theproposed KLIEP-based importance loss (13) and train thismodel following the same experimental setup as previously.

L = LIWAdv = Exs[log d(xt)]

+ Exs[∑l

αlKσt(xs, xtl) log(1− d(g(xs))]

(13)

In both trainings we intentionally switch off random batches.Both trained models, the vanilla GAN and KLIEP GAN

one, were deployed on 10000 source vectors for inference.

Fig. 3. Histograms for source data (blue), target data (red), generated byVanilla GAN (orange), generated by our KLIEP GAN (green).

Distribution µ σ Wasserstein distance Energy distance

Target (Gauss) 7.0 0.5 - -

Source (uniform) 5.0 2.9 2.56 1.39

Vanilla GAN 7.7 2.0 1.32 0.79

Ours 6.7 1.8 1.08 0.67

TABLE IIDISTANCES BETWEEN GENERATED AND TARGET DISTRIBUTIONS (LESS

IS BETTER).

Resulting distributions were compared in terms of momentsand distance to the target. They are depicted in the figure 3in orange and green respectively. We evaluate generated dis-tributions using Wasserstein and Energy distances betweenthem and target distribution. Obtained results are reported inthe table II.From the results of the ablation study on the toy data we cantell that usage of the density ratio estimator for distributionpre-matching significantly improves the results of adversariallearning. Distribution generated with the importance lossis closer to target one in terms of the moments as wellas in terms of distances. Wasserstein distance to the targetdistribution improves by 20% and energy distance by 15%.

B. Real Data

Similarly our large scale evaluation pipeline consists of 2stages as well. First we train our domain adaptation networkwith synthetic and real datasets. Then we deploy it on thesynthetic images and translate them to real domain. In thesecond stage we train multiple target prediction task modelswith translated images and evaluate theirs performance onreal validation dataset.

As a real dataset we take one of the recent datasetcalled Cityscapes [6] as most commonly used in autonomousdriving community. It provides 5000 frames of urban trafficscenes of resolution 2048×1024 alongside with fine pixel-level semantic labels. Samples are split into training, valida-tion and test subsets. It enfolds 50 cities multiple times of theyear, different daylight and weather conditions. Cityscapes

Page 6: KLIEP-based Density Ratio Estimation for Semantically

Cityscapes Ground truth CycleGAN CyCADA UNIT OursFig. 4. Examples of semantic segmentation by DRN26 trained on translated synthetic images.

Cityscapes Ground truth CycleGAN CyCADA UNIT OursFig. 5. Examples of semantic segmentation by Deeplabv3 trained on translated synthetic images.

provides ground-truth for semantic, instance and pan-opticsegmentation. Semantic segmentation covers 30 classes andalso single instances annotations for dynamic objects suchas car, person, rider etc. We focus evaluation on 19 trainingclasses: road, building, sky, sidewalk, vegetation, car, terrain,wall, truck, pole, fence, bus, person, traffic light, traffic sign,train, motorcycle, rider, bicycle.

As synthetic one we utilize the dataset from [32] as a mostcomprehensive synthetic dataset. It provides almost 25000frames acquired from a computer game engine alongsidewith semantic labels. Every frame is of resolution 1914 ×1052. Although, it reveals some labeling bugs it remains themain synthetic dataset for autonomous driving. It shows cer-tain advantages in comparison with other synthetic datasetsw.r.t traffic scenes. It is by far more realistic in terms ofappearance as well as in terms of traffic scene construction. Itshows a huge variance in scenery, scenarios and appearance.

First, we evaluate the results qualitatively. Results ofdomain adaptation in comparison with other approachescould be seen on the figure 2. On this figure one can seethat translation by multiple models introduces mismatchingpatches in place of the classes e.g. vegetation and sky. In

turn density ratio prematching enables translation model topreserve semantics.

Most importantly, we evaluate quality of image transfer onthe semantic segmentation task. For this evaluation we trainstate-of-the-art segmentation models on our generated dataand evaluate on Cityscapes val dataset. Needs to be said thatduring the training segmentation model did not ”see” anyreal images from target dataset.

We follow the original works in our evaluation experi-ments. As a preprocessing step all images were down-scaledto 1024 × 512 pixels resolution. In our evaluation we relyon DRN [52] and Deeplabv3 [3]. DRN26 was initialized onthe weights pretrained on Imagenet [22] and fine-tuned for200 Epochs with random crops 600 × 600 of our translateddata with momentum 0.99 and learning rate 0.001 decreasingby 10 every 100. Deeplabv3 utilizes xception65 backbone anhas been trained for 90,000 steps with batch size of 16, wekeep learning rate of 0.007 and crops of 513 × 513. Obtainedmetrics for the best performing snapshots for both networksare reported in the table I. Additionally we train Deeplabv2[2] and evaluate on Cityscapes val. In this evaluation we alsofollow the setup of original work.

Page 7: KLIEP-based Density Ratio Estimation for Semantically

Original Ground Truth Low Medium HighFig. 6. Examples of image transfer by KLIEP GAN trained on translated synthetic images of different importance cohorts.

Our main metric is IoU or Jaccard Index for particularclass it calculates ratio of correctly classified pixels relativelyto true positive, false positive and false negative predictionssummed [8]. We additionally report its mean value overall 19 classes. This metric helps to take into considerationsegmentation performance not affected by the size of partic-ular class itself. We report however pixel accuracy as well.The results obtained in our experiments are presented inthe table I. The tables show performance of DRN26 andDeeplab networks trained on dataset generated by translationof synthetic images to real ones. Additionally we providecomparison numbers for the aforementioned nets on merelyreal (CS) and synthetic data (PfD).

In the table I one can see that pre-matching densities usingKLIEP could improve performance by meanIoU and alsoby major classes such as road, building, vegetation, sky andcar. Class sky has shown improvement by almost 7% otherclasses by more than 2%. For the Deeplab CyCADA remainstop performing model w.r.t meanIoU but was improvedby densities pre-matching for multiple classes as building,vegetation, sky, truck and bus.

C. Ablation Study

Additionally to the toy example we perform ablation studyalso on the large scale datasets. The intention is to showhow importance estimation in our KLIEP GAN influencesthe adversarial training. For that purpose we split the sourcedataset samples according to their importance estimates into3 equal cohorts. Each cohort consists of 8322 training pairsand represents certain importance range: low, medium andhigh. Following evaluation steps greatly reproduce our mainevaluation pipeline. We downscale though all samples to theresolution of 512×256 to speed up the evaluation process.We train 3 instances of our KLIEP GAN model on therespective cohort as a source dataset and Cityscapes train asa target dataset. After that, each of 3 models is deployed on[32] dataset, which results in 3 adapted datasets with 24,966transferred samples each. As a final step, we train 3 DRNmodels on the corresponding generated dataset and evaluatethem on Cityscapes val.

The results for of the ablation study are reported in thetable III. Here one can see the IoU values provided formajor classes as well as meanIoU for all 19 original classes.The numbers reported in the ablation study confirm theintuition that the higher importances estimated by KLIEPreflect similarity with the target distributions. Thus, one cansay that learning from more informative (with higher impor-tance score) samples improves quality of adversarial imagetranslation. Such qualitative improvement could be observed

Met

hod

mea

nIo

U

road side

wal

k

build

ing

wal

l

fenc

e

vege

tatio

n

sky

car

Cityscapes [6] 67.4 97.3 79.8 88.6 32.5 48.2 89.0 93.0 92.2

No adapt [32] 21.7 42.7 26.3 51.7 5.5 6.8 75.5 36.8 46.7

Low 27.9 75.6 28.7 69.1 14.5 18.5 63.4 45.8 75.7

Medium 28.3 77.8 24.0 71.6 10.7 17.1 69.3 69.6 73.5

High 30.2 82.2 40.2 72.1 15.3 23.2 72.9 69.5 77.6

TABLE IIIIOU VALUES FOR SEMANTIC SEGMENTATION PREDICTION BY DRN26

TRAINED ON TRANSLATED SYNTHETIC TO REAL IMAGES OBTAINED

FROM DIFFERENT IMPORTANCE COHORTS.

in figure 6. Table III also confirms gradual improvement oftranslation quality as we move from low importance cohortto high importance meanIoU raises.

V. CONCLUSION

In this paper we proposed the usage of the densityprematching domain adaptation based on KLIEP densityratio estimation procedure combined with effective cycle-consistency loss in order to tackle class covariate shiftproblem in synthetic and real datasets. We have shown inour experiments that this strategy works well for syntheticto real domain adaptation. First, we visualized the effectsof KLIEP based loss of our model on the toy example.Here we have shown that distribution pre-matching is veryhelpful mean by adversarial learning of target distribution.In our large scale experiment we have shown that KLIEPloss not only improves visual quality of transferred syntheticto real image (mainly in terms of semantical consistency)but also improves performance of deep semantic segmenta-tion network trained on the translated images (improvementby highly imbalanced classes such as vegetation and skyachieved >7%). And finally our ablation study visualizedhow importance scores obtained by KLIEP affect adversarialtraining of the model.

VI. ACKNOWLEDGEMENT

The research leading to these results is funded by theGerman Federal Ministry for Economic Affairs and Energywithin the project “KI Absicherung – Safe AI for AutomatedDriving”. The authors would like to thank the consortium forthe successful cooperation.

Page 8: KLIEP-based Density Ratio Estimation for Semantically

REFERENCES

[1] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic.Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a largedataset of cad models. IEEE CVPR, 2014.

[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille.Semantic image segmentation with deep convolutional nets and fullyconnected crfs. In ICLR, 2015.

[3] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for semantic imagesegmentation. In ECCV, 2018.

[4] Y. Chen, W. Li, and L. Van Gool. Road: Reality oriented adaptationfor semantic segmentation of urban scenes. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019.

[5] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang. Crdoco: Pixel-level domain transfer with cross-domain consistency. In CVPR, 2019.

[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset forsemantic urban scene understanding. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning opticalflow with convolutional networks. IEEE ICCV, 2015.

[8] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal visual object classes challenge:A retrospective. International Journal of Computer Vision, 2015.

[9] H. Fu, M. Gong, C. Wang, K. Batmanghelich, K. Zhang, and D. Tao.Geometry-consistent generative adversarial networks for one-sidedunsupervised domain mapping. In IEEE CVPR, 2019.

[10] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy formulti-object tracking analysis. IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016.

[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems. 2014.

[12] V. Haltakov, C. Unger, and S. Ilic. Framework for generation ofsynthetic ground truth data for driver assistance applications. In GCPR,2013.

[13] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla.Understanding realworld indoor scenes with synthetic data. IEEEConference on Computer Vision and Pattern Recognition, 2016.

[14] A. Handa, T. Whelan, J. McDonald, and A. J. Davison. A benchmarkfor rgb-d visual odometry, 3d reconstruction and slam. In IEEEInternational Conference on Robotics and Automation (ICRA), 2014.

[15] H. Hattori, V. N. Boddeti, K. Kitani, and T. Kanade. Learning scene-specific pedestrian detectors without real data. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015.

[16] R. D. Hjelm, A. P. Jacob, T. Che, K. Cho, and Y. Bengio. Boundary-seeking generative adversarial networks. ArXiv, 2017.

[17] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A.Efros, and T. Darrell. Cycada: Cycle consistent adversarial domainadaptation. In International Conference on Machine Learning, 2018.

[18] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in thewild: Pixel-level adversarial and constraint-based adaptation. CoRR,abs/1612.02649, 2016.

[19] Y. Hong, U. Hwang, J. Yoo, and S. Yoon. How generative adversarialnetworks and their variants work: An overview. ACM Comput. Surv.,52, 2019.

[20] B. Kaneva, A. Torralba, and W. T. Freeman. Evaluation of imagefeatures using a photorealistic virtual world. In Proceedings of the2011 International Conference on Computer Vision (ICCV), 2011.

[21] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In NIPS, 2017.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In NIPS, 2012.

[23] K. Li, X. Wang, Y. Xu, and J. Wang. Density enhancement-based long-range pedestrian detection using 3-d range data. IEEE Transactionson Intelligent Transportation Systems, 17, 2016.

[24] P. Li, X. Liang, D. Jia, and E. P. Xing. Semantic-aware grad-ganfor virtual-to-real urban scene adaption. In British Machine VisionConference (BMVC), 2018.

[25] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-imagetranslation networks. In Advances in Neural Information ProcessingSystems 30, 2017.

[26] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. InAdvances in Neural Information Processing Systems 29, 2016.

[27] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer lookat domain shift: Category-level adversaries for semantics consistentdomain adaptation. IEEE CVPR, 2019.

[28] J. Marin, A. M. Lopez, D. Geronimo, and D. Vazquez. Learningappearance in virtual scenarios for pedestrian detection. In IEEECVPR, 2010.

[29] J. Papon and M. Schoeler. Semantic pose using deep networkstrained on synthetic rgb-d. In 2015 IEEE International Conferenceon Computer Vision (ICCV), 2015.

[30] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep object detectorsfrom 3d models. In Proceedings of the 2015 IEEE InternationalConference on Computer Vision (ICCV), 2015.

[31] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-view and 3ddeformable part models. IEEE Transactions on Pattern Analysis andMachine Intelligence, 37(11), 2015.

[32] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data:Ground truth from computer games. In ECCV, 2016.

[33] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. Thesynthia dataset: A large collection of synthetic images for semanticsegmentation of urban scenes. In IEEE CVPR, 2016.

[34] F. Sadat Saleh, M. Sadegh Aliakbarian, M. Salzmann, L. Petersson,and J. M. Alvarez. Effective use of synthetic data for urban scenesemantic segmentation. In Proceedings of the ECCV, 2018.

[35] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximumclassifier discrepancy for unsupervised domain adaptation. IEEE/CVFConference on Computer Vision and Pattern Recognition, 2017.

[36] S. Satkin, J. H. Lin, and M. Hebert. Data-driven scene understandingfrom 3d models. In BMVC, 2012.

[37] J. Shotton, R. B. Girshick, A. W. Fitzgibbon, T. Sharp, M. Cook,M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, andA. Blake. Efficient human pose estimation from single depth images.IEEE TPAMI, 35 12, 2013.

[38] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, andR. Webb. Learning from simulated and unsupervised images throughadversarial training. IEEE CVPR, 2017.

[39] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A DIRT-T approachto unsupervised domain adaptation. In International Conference onLearning Representations (ICLR), 2018.

[40] M. Sinn and A. Rawat. Non-parametric estimation of jensen-shannondivergence in generative adversarial network training. In AISTATS,2017.

[41] M. Sugiyama and M. Kawanabe. Machine Learning in Non-StationaryEnvironments: Introduction to Covariate Shift Adaptation. The MITPress, 2012.

[42] M. Sugiyama, S. Nakajima, H. Kashima, P. v. Bunau, and M. Kawan-abe. Direct importance estimation with model selection and itsapplication to covariate shift adaptation. In NIPS, 2007.

[43] B. Sun and K. Saenko. From virtual to reality: Fast adaptation ofvirtual object detectors to real domains. In BMVC, 2014.

[44] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domainimage generation. ICLR, 2017.

[45] G. R. Taylor, A. J. Chosak, and P. C. Brewer. Ovvv: Using virtualworlds to design and evaluate surveillance systems. IEEE Conferenceon Computer Vision and Pattern Recognition, 2007.

[46] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, andM. Chandraker. Learning to adapt structured output space for semanticsegmentation. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018.

[47] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. K.Chandraker. Learning to adapt structured output space for semanticsegmentation. IEEE CVPR, 2018.

[48] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev,and C. Schmid. Learning from synthetic humans. IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017.

[49] D. Vazquez, A. M. Lopez, J. Marın, D. Ponsa, and D. Geronimo.Virtual and real world adaptation for pedestrian detection. IEEETransactions on Pattern Analysis and Machine Intelligence, 2014.

[50] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V.Gool. Sliced wasserstein generative models. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019.

[51] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuperviseddual learning for image-to-image translation. IEEE InternationalConference on Computer Vision (ICCV), 2017.

[52] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. InComputer Vision and Pattern Recognition (CVPR), 2017.

[53] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEEInternational Conference on Computer Vision, 2017.

[54] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domainadaptation for semantic segmentation via class-balanced self-training.In The European Conference on Computer Vision (ECCV), 2018.