domainadaptaon...

Domain Adapta,on in a deep learning context

Tinne Tuytelaars

Work in collabora,on with … •  T. Tommasi, N. Patricia, B. Caputo, T. Tuytelaars "A Deeper

Look at Dataset Bias" GCPR 2015 •  A. Raj, V. Namboodiri and T. Tuytelaars, "Subspace Alignment

Based Domain Adapta9on for RCNN Detector", BMVC 2015.

Thanks

Thanks to

(a) Prof. R. Hegde (b) Prof. A. MukerjeeSpecial thanks to

(a) Prof. V. Namboodiri (b) Prof. T. Tuytelaars(c) Sachin (Forinstalling ca↵e :P)

Thanks to all the committee membersThanks to my colleagues

Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 37 / 39

Some key ques,ons

•  How relevant is DA in combina,on with CNN features ?

•  Is integra,on of DA in end-‐to-‐end learning the ul,mate answer ?

•  How do we properly evaluate DA methods ? •  Beyond classifica,on ?

Domain Adapta,on

•  Domain = a set of data defined by its marginal and condi,onal distribu,ons with respect to the assigned labels

•  Dataset bias – Capture bias – Category or label bias – Nega,ve bias

How relevant is DA with CNN ? •  Most widely used DA seVng for CV:

– Office Dataset –  SURF + BoW representa,on

•  Modern features (CNN) give be_er results even without DA

•  “CNN features have been trained on very large datasets, to be robust to all possible types of variability. So we do not need any DA anymore.”

How relevant is DA with DNN?

•  Always compare DA algorithms on top of the same representa,ons

•  Performance on target depends on: -‐ performance on source -‐ domain divergence

-‐ model that fits both source and target (e.g. nega,ve bias, annotator bias)

How relevant is DA with DNN ?

•  CNN features > SURF + BOW – More robust – Learned on larger datasets

•  CNN features are data-‐driven – Be careful how to collect data, avoid unwanted biases

– May make them more sensi,ve to domain shids…

Some key ques,ons




Integra,on of DA in DNN

•  Learn representa,ons that yield good classifica,on AND are robust to domain changes, using deep / end-‐to-‐end learning

•  “We can learn representa,ons that are even more robust using DA”

Prac,cal applica,ons

•  Use pretrained models of ImageNet on my own photo-‐collec,on

•  Use off-‐the-‐shelve pedestrian detector even under low illumina,on condi,ons

•  …

-‐> calls for simple, light adapta,on methods

•  Can we s,ll use the simple DA methods developed before ? (in par,cular, we focus on subspace based methods)

•  Are methods developed for BoW-‐features suitable for CNN features ?

Some key ques,ons




A cross-‐dataset benchmark

Cross-‐dataset benchmark

•  Sparse Set

Cross-‐dataset benchmark

•  Dense set

New evalua,on criterion

•  % drop

•  CD measure

A Deeper Look at Dataset Bias 5

1 5 10 50 100 500 10000

10

20

30

40

50

60

70

Num. of training samples per dataset

% R

ecog

nitio

n R

ate

Sparse set � 12 datasets

BOWsiftDeCAF6DeCAF7Chance

assigned

co

rre

ct

BOWsift

AW

A

AY

H

C1

01

ET

H

MS

R

OF

C

RG

BD

PA

S

SU

N

BIN

G

C2

56

IMG

AWAAYH

C101ETHMSROFC

RGBDPASSUN

BINGC256IMG

0

20

40

60

80

100

assigned

co

rre

ct

DeCAF7

AW

A

AY

H

C1

01

ET

H

MS

R

OF

C

RG

BD

PA

S

SU

N

BIN

G

C2

56

IMG

AWAAYH

C101ETHMSROFC

RGBDPASSUN

BINGC256IMG

0

20

40

60

80

100

Fig. 1 Name the dataset experiment over the sparse setup with 12 datasets. We use a linear SVM clas-sifier with C value tuned by 5-fold cross validation on the training set. We show average and standarddeviation results over 10 repetitions. The title of each confusion matrix indicates the feature used for thecorresponding experiments.

One way to quantitatively evaluate the cross dataset generalization was previouslyproposed in [39]. It consists of measuring the percentage drop (% Drop) betweenSelf and Mean Others. However, being a relative measure, it looses the informa-tion on the value of Self which is important if we want to compare the effect of dif-ferent learning methods or different representations. In fact a 75% drop w.r.t a 100%self average precision has a different meaning than a 75% drop w.r.t. a 25% self aver-age precision. To overcome this drawback, we propose here a different Cross-Dataset

(CD) measure defined as

CD =

1

1 + exp�{(Self�Mean Others)/100} .

CD uses directly the difference (Self�Mean Others) while the sigmoid functionrescales this value between 0 and 1. This allows for the comparison among the resultsof experiments with different setups. Specifically CD values over 0.5 indicate a pres-ence of a bias, which becomes more significant as CD gets close to 1. On the otherhand, CD values below 0.5 correspond to cases where either Mean Other ⇥ Selfor the Self result is very low. Both these conditions indicate that the learned model isnot reliable on the data of its own collection and it is difficult to draw any conclusionfrom its cross-dataset performance.

4 Studying the Sparse set

4.1 Name the Dataset

With the aim of getting an initial idea on the relation among the datasets and how dif-ferent representations capture their specific content, we start our analysis by runningthe name the dataset test on the sparse data setup. We extract randomly 1000 imagesfrom each of the 12 collections and we train a 12-way linear SVM classifier that wethen test on a disjoint set of 300 images. The experiment is repeated 10 times withdifferent data splits and we report the obtained average results in Figure 1. The plot onthe left shows the recognition rate as a function of the training set size and indicatesthat DeCAF allows for a much better separation among the collections than what isobtained with BOWsift. In particular DeCAF7 shows an advantage over DeCAF6 for

Some experimental results A Deeper Look at Dataset Bias 11

BOWsift DeCAF7

0

0.2

0.4

0.6

0.8

1

Reco

gn

itio

n R

ate

sp

oo

nb

asketb

all

ho

op

can

so

da

kn

ife

traff

iclig

ht

bath

tub

co

ffee

mu

gw

ind

mill

fire

exti

ng

uis

her

lad

der

lig

hth

ou

se

toaste

rkeyb

oard

kn

ob

can

no

nskyscra

per

teap

ot

do

gla

pto

pcake

tsh

irt

um

bre

lla

co

mp

ute

rm

on

ito

rco

mp

ute

rm

ou

se

ho

rse

mic

row

ave

peo

ple

refr

igera

tor

sn

eaker

billiard

sste

eri

ng

wh

eel

wash

ing

mach

ine

ch

an

delier

gra

nd

pia

no

palm

tree

backp

ack

helico

pte

rm

oto

rcycle

aero

pla

ne

car

C256 C256C256 IMGC256 SUN

0

0.2

0.4

0.6

0.8

1

Re

co

gn

itio

n R

ate

can

so

da

lad

der

kn

ife

cake

refr

igera

tor

kn

ob

traff

iclig

ht

fire

exti

ng

uis

her

backp

ack

can

no

nkeyb

oard

ho

rse

tsh

irt

bath

tub

peo

ple

sp

oo

nste

eri

ng

wh

eel

basketb

all

ho

op

co

mp

ute

rm

ou

se

wash

ing

mach

ine

mic

row

ave

sn

eaker

toaste

rw

ind

mill

billiard

sco

ffee

mu

gco

mp

ute

rm

on

ito

rd

og

helico

pte

ru

mb

rella

lap

top

lig

hth

ou

se

teap

ot

aero

pla

ne

ch

an

delier

skyscra

per

gra

nd

pia

no

palm

tree

car

mo

torc

ycle

C256 C256C256 IMGC256 SUN

assigned

co

rre

ct

C256 C256

people umbrella laptop

people

umbrella

laptop

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

assigned

co

rre

ct

C256 IMG

people umbrella laptop

people

umbrella

laptop

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Table 3 Recognition rate per class from the multiclass cross-dataset generalization test. C256, IMG andSUN stand respectively for Caltech256, Imagenet and SUN datasets. We indicate with “train-test” the pairof datasets used in training and testing.

Table 4 Left: Imagenet images annotated as Caltech256 data with BOWsift but correctly recognized withdecaf7. Right: Caltech256 images annotated as Imagenet by BOWsift but correctly recognized with De-CAF7.

Since all the datasets contain the same object classes, we are in fact reproducinga setup generally adopted for domain adaptation [16,12]. By simplifying the datasetbias problem and identifying each dataset with a domain, we can interpret the resultsof this experiment as an indication of the domain divergence [2] and deduce that amodel trained on SUN will perform poorly on the object-centric collections and viceversa. On the other hand, a better cross dataset generalization should be observedamong Imagenet, Caltech256 and Bing. We verify it in the following sections.

5.2 Cross-dataset generalization test.

We consider the same setup used before with 15 samples per class from each collec-tion in training and 5 samples per class in test. However, now we train a one-vs-all

Some experimental results 6 Tatiana Tommasi et al.

BOWsift % Drop CD DeCAF6 % Drop CD DeCAF7 % Drop CD

Car

test

trai

n

28.6

25.4

14.7

17.4

26.4

26.8

9.1

8.7

26.7

25.7

98.5

9.7

29.2

25.7

25.2

90.5

P S E M

P

S

E

M

3.9 0.50

4.3 0.50

83.4 0.69

86.8 0.69test

trai

n

26.2

15.4

27.3

41.6

19.2

15.6

18.7

12.8

44.3

19.9

100.0

95.8

42.6

17.9

93.5

99.7

P S E M

P

S

E

M

-35.1 0.47

-13.9 0.49

53.5 0.63

49.8 0.62test

trai

n

14.5

27.1

31.0

43.8

11.8

27.4

24.0

14.9

28.9

32.8

100.0

93.8

28.3

28.1

90.9

99.7

P S E M

P

S

E

M

-59.1 0.47

-6.8 0.49

51.3 0.62

49.0 0.62

Cow

test

trai

n

14.1

16.1

2.3

8.6

18.8

38.3

2.3

17.4

11.9

12.5

29.8

2.4

17.9

27.1

2.0

52.5

P S E M

P

S

E

M

-15.1 0.49

51.4 0.54

92.6 0.57

82.0 0.61test

trai

n35.3

18.6

4.4

13.8

34.4

62.7

8.3

46.7

20.2

10.7

91.9

9.3

39.1

33.5

4.1

97.2

P S E M

P

S

E

M

11.4 0.51

66.7 0.60

93.9 0.70

76.0 0.68test

trai

n

46.6

12.8

7.2

14.1

44.0

40.1

7.7

39.4

31.3

12.1

90.1

10.4

47.3

23.6

5.3

97.7

P S E M

P

S

E

M

12.3 0.51

59.7 0.56

92.5 0.70

78.2 0.68

Cow

-fixe

dne

gativ

es

test

trai

n

14.1

18.7

23.5

8.9

16.2

38.3

22.2

9.4

12.0

17.0

29.8

2.0

18.6

35.8

14.0

52.5

P S E M

P

S

E

M

-10.7 0.49

-37.8 0.53

33.3 0.52

87.1 0.61test

trai

n

35.3

38.5

6.1

65.2

32.7

62.7

7.9

62.6

16.9

21.0

91.9

51.5

46.6

69.5

4.6

97.2

P S E M

P

S

E

M

9.1 0.50

31.4 0.54

93.2 0.70

38.5 0.59test

trai

n

46.6

25.7

11.2

69.4

39.1

40.1

9.0

55.1

26.4

12.8

90.1

47.4

48.9

43.3

11.0

97.7

P S E M

P

S

E

M

18.2 0.52

31.9 0.53

88.4 0.69

41.3 0.59

Table 1 Binary cross-dataset generalization for two example categories, car and cow. Each matrix con-tains the object classification performance (AP) when training on one dataset (rows) and testing on an-other (columns). The diagonal elements correspond to the self results, i.e. training and testing on the samedataset. The percentage difference between the self results and the average of the other results per row cor-responds to the value indicated in the column “% Drop”. CD is our newly proposed cross-dataset measure.We report in bold the values higher than 0.5. P,S,E,M stand respectively for the datasets Pascal VOC07,SUN, ETH80, MSRCORID.

large number of training samples. From the confusion matrices (middle and right inFigure 1) we see that it is easy to distinguish ETH80, Office and RGB-D datasetsfrom all the others regardless of the used representation, given the specific lab-natureof these collections. DeCAF captures better than BOWsift the characteristics of A-Yahoo, MSRCORID, Pascal VOC07 and SUN, improving the recognition results onthem. Finally, Bing, Caltech256 and Imagenet are the datasets with the highest con-fusion level, an effect mainly due to the large number of classes and images per class.Still, this confusion decreases when using DeCAF.

The information obtained in this way over the whole collections provides us onlywith a small part of the full picture about the bias problem. The dataset recognitionperformance does not give an insight on how the classes in each collection are relatedamong each other, nor how a model defined for a class in one dataset will generalizeto the others. We look into this problem in the following paragraph.

4.2 Cross-dataset generalization test

With a procedure similar to that in [39], we perform a cross-dataset generalizationexperiment over two specific object classes shared among multiple datasets: car andcow. For the class car we selected randomly from four collections of the sparse settwo groups of 50 positive and 1000 negative examples respectively for training and

Undoing dataset bias

8 Tatiana Tommasi et al.

are kept separated as belonging to different tasks and a max-margin model is learnedfrom the information shared over all of them.

4.3.1 Method.

Let us assume to have i = 1, . . . , n datasets D1, . . . , Dn each consisting of si train-ing samples Di = {(xi

1, yi1), . . . , (x

isi , y

isi)}. Here x

ij ⇤ Rm represents the m-

dimensional feature vector and yij ⇤ {�1, 1} represents the label for the examplej of dataset Di. Specifically all the datasets share an object class and its images areannotated with label 1, while all the other samples in the different datasets have la-bel -1. The Unbias algorithm [25] consists in learning a binary model per datasetwi = wvw + �i, where wvw is a model for the visual world, while �i is the biasfor each dataset. These two parts are obtained by solving the following optimizationproblem:

minwvw,�i

⇥,⇤

1

2�wvw�2 +

�

2

n�

i=1

�⇥i�2 + C1

n�

i=1

si�

j=1

⇤ij + C2

n�

i=1

si�

j=1

⌅ij (1)

subject to wi = wvw +�i (2)yijwvw · xi

j ⇥ 1� ⇥ij (3)

yijwi · xij ⇥ 1� ⇤ij (4)

⇥ij ⇥ 0, ⇤ij ⇥ 0 (5)i = 1, . . . , n j = 1, . . . , si . (6)

S C1 P O S C1 P O�40

�20

0

20

40

60

%A

P d

iffer

ence

aga

inst

bas

elin

e

Car � 3 datasets

BOWsiftDeCAF7

E P A C1 S C2 O E P A C1 S C2 O�20

0

20

40

60

80

%A

P d

iffer

ence

aga

inst

bas

elin

e

Dog

BOWsiftDeCAF7

M P S A E O M P S A E O�20

0

20

40

60

80

%A

P d

iffer

ence

aga

inst

bas

elin

e

Cow

BOWsiftDeCAF7

P S C1 E M O P S C1 E M O�30

�20

�10

0

10

20

%A

P d

iffer

ence

aga

inst

bas

elin

e

Car � 5 datasets

BOWsiftDeCAF7

P S C1 OF M O P S C1 OF M O�20

0

20

40

60

80

100%

AP

diff

eren

ce a

gain

st b

asel

ine

Chair

BOWsiftDeCAF7 Cow Self Other

M 98.97 86.56P 53.67 34.29S 40.89 29.20A 44.76 25.94E 99.40 86.56

Fig. 2 Percentage difference in average precision between the results of Unbias and the baseline All overeach target dataset. P,S,E,M,A,C1,C2,OF stand respectively for the datasets Pascal VOC07, SUN, ETH80,MSRCORID, AwA, Caltech101, Caltech256 and Office. With O we indicate the overall value, i.e. theaverage of the percentage difference over all the considered datasets (shown in black).

DA methods with DeCAF7 16 Tatiana Tommasi et al.

5 10 15 20 25 30 35 40 45 50 550.3

0.4

0.5

0.6

0.7

0.8

0.9

# training samples per class

Rec

ogni

tion

Rat

e

40 classes Bing�Caltech256 experiments

Bing�CaltechCaltech�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling

0 5 100.5

0.6

0.7

self�labelling steps

Rec

ogni

tion

Rat

e DeCAF7

0 5 100.1

0.105

0.11

0.115

self�labelling stepsR

ecog

nitio

n R

ate BOWsift

5 10 15 20 25 30 35 40 45 50 550.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

# training samples per class

Rec

ogni

tion

Rat

e

40 classes Bing�SUN experiments

Bing�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling

0 5 100.16

0.18

0.2

0.22


Rec

ogni

tion

Rat

e DeCAF7

0 5 100.056

0.058

0.06

0.062


Rec

ogni

tion

Rat

e BOWsift

Fig. 4 Results of the Bing-Caltech256 and Bing-SUN experiments with DeCAF7. We report the perfor-mance of different domain adaptation methods (big plots) together with the recognition rate obtained in 10subsequent steps of the self-labelling procedure (small plots). For the last ones we show the performanceobtained both with DeCAF7 and and with BOWsift when having originally 10 samples per class fromBing.

In Figure 4 we present the results obtained using DECAF7 with the landmarkmethod, and the reshape approach combined respectively with SA and Domain Adap-tation Machine (DAM [8]). More in details, we use the reshape method to divide theBing images into two subgroups and we apply SA to adapt each of them to the tar-get. We identify the best subgroup as that with the highest target performance andwe report its results. The subgroups are instead considered as two source domainswith DAM: we assign different weights to their relative importance and show alsoin this case the best obtained performance4. As reference we also present the per-formance of the SA and DAM method without reshaping. Finally we test a simpleself-labelling strategy that was already used in [38]. Differently from the previouslydescribed techniques this method learns a model over the full set of Bing data andprogressively selects target samples to augment the training set.

The obtained results go in the same direction of what observed previously withthe Unbias method. Despite the presence of noisy data, subselecting them (with land-mark) or grouping the samples (reshape+SA, reshape+DAM) do not seem to workbetter than just using all the source data at once. On the other way round self-labelling

[38] consistently improves the original results with a significant gain in performanceespecially when only a reduced set of training images per class is available. One wellknown drawback of this strategy is that subsequent errors in the target annotationsmay lead to significant drift from the correct solution. However, when working withDeCAF features this risk appears highly reduced: this can be appreciated looking atthe recognition rate obtained over ten iterations of the target selection procedure, con-sidering in particular the comparison against the corresponding performance obtainedwhen using BOWsift (see the small plots in Figure 4).

6 Conclusions

In this paper we attempted at positioning the dataset bias problem in the CNN-based features arena with an extensive experimental evaluation. At the same time,we pushed the envelope in terms of the scale and complexity of the evaluation proto-

4 More details on the experimental setup can be found in the supplementary material.

Conclusions

•  CNN features (DeCaf7) are more powerful, but this by itself doesn’t suffice to avoid domain shid

•  Some,mes too specific and even worse performance (both class – and dataset dependent)

•  Nega,ve bias problem remains •  Standard DA methods do not always perform well on CNN-‐features – more difficult to generalize ?

•  Best results with self-‐labeling on target data

Adap,ng R-‐CNN detector from Pascal VOC to Microsod COCO

RCNN Detector

Figure : Object detection system overview using RCNN. RCNN (1) takes an inputimage, (2) extracts around 2000 bottom-up region proposals, (3) computesfeatures for each proposal using a large convolutional neural network (CNN), andthen (4) classifies each region using class-specific linear SVMs.


6 RAJ, NAMBOODIRI, TUYTELAARS: ADAPTING RCNN DETECTOR

(a) PASCAL (b) PASCAL (c) PASCAL (d) COCO (e) COCO (f) COCOFigure 1: Image samples taken from COCO validation set and PASCAL VOC 2012 to showdomain shift

5.1 Experimental Setup

PASCAL dataset is simpler than COCO and also smaller hence it has been considered assource dataset and COCO is considered as target dataset. For target dataset(COCO) sub-space generation, the examples having a classifier score greater than a threshold of 0.4 areconsidered positive examples for the target subspace. Non maximum supression is removedfrom the process for consistency in the number of samples for subspace generation. Forsource dataset(PASCAL), we evaluate the overlap of each object proposal region with theground truth bounding box and consider the bounding boxes with threshold greater than 0.7with the ground truth bounding box as candidates for our source subspace generation. Thedimensionality of subspace for both the source and target dataset is kept fixed at 100. Sub-space alignment is done between source and target subspaces and source data set is projectedon the target aligned source subspace for further training. Once the new detectors are trainedon the transformed feature the same procedure is applied as RCNN for detection on projectedtarget image features.5.2 Results

Here in this section we provide evidence to show the performance of our method, compareour results to other baselines and analyse the results. First we consider the statistical dif-ference between the PASCAL VOC 2012 and COCO validation set data. We run RCNNdetector on both source and target data. We plot the histograms of score obtained for boththe dataset. It can be observed from their histogram in image 2 that there exists statistical dis-similarity between both these datasets. Therefore, there is a need for domain adaptation. Thehistogram evaluation has been done in two setting, first in a category wise setting and secondas a full dataset jointly. The findings of both these settings is similar and demonstrates thestatistical dissimilarity between these two datasets.

Figure 2: Histogram of scores. Fig 1 is for CoCo dataset and fig 2 for PASCAL VOC. Scoresis taken along x-axis and no. of object region with that score along y-axis

Now we evaluate our method on these two datasets. First baseline to compare with ourmethod is a simple RCNN detector trained on PASCAL VOC 2012. We evaluate the perfor-mance of RCNN detector for all the 20 categories which are there in the PASCAL datasets.

•  Ini,alize the detec,on on Target dataset •  Avoid non maximum suppression while learning target subspace

•  Discard noisy detec,ons while ini,aliza,on •  Ini,al RCNN is trained on Pascal VOC 2012 •  We generate class specific target subspaces and apply subspace alignment approach on each class separately

Adapting RCNN Detector Result

No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform

1 plane 36.72 35.44 40.1 35.12 bicycle 21.26 18.95 23.28 1.93 bird 12.50 12.37 13.63 3.74 boat 10.45 8.8 10.61 2.35 bottle 8.75 11.46 8.11 76 bus 37.47 38.12 40.64 45.4

7 car 20.6 20.4 22.5 18.38 cat 42.4 43.6 45.6 8.69 chair 9.6 6.3 8.8 6.310 cow 23.28 20.40 25.3 1711 table 15.9 14.9 17.3 4.812 dog 28.42 32.72 31.3 5.813 horse 30.7 31.11 32.9 35.3


Adapting RCNN Detector Result

No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform

14 motorbike 31.2 29.05 34.6 25.415 person 27.8 28.8 30.9 17.516 plant 12.65 7.34 13.7 4.117 sheep 19.99 21.04 22.4 14.518 sofa 14.6 8.4 15.5 9.619 train 39.2 38.4 41.64 31.720 tv 28.6 26.4 29.9 27.9

Mean AP 23.60 22.7 25.43 16.9

Table : Domain adaption detection result on validation set of COCO dataset.


Questions ?

domainadaptaon...

Documents