domainadaptaon...
TRANSCRIPT
Domain Adapta,on in a deep learning context
Tinne Tuytelaars
Work in collabora,on with … • T. Tommasi, N. Patricia, B. Caputo, T. Tuytelaars "A Deeper
Look at Dataset Bias" GCPR 2015 • A. Raj, V. Namboodiri and T. Tuytelaars, "Subspace Alignment
Based Domain Adapta9on for RCNN Detector", BMVC 2015.
Thanks
Thanks to
(a) Prof. R. Hegde (b) Prof. A. MukerjeeSpecial thanks to
(a) Prof. V. Namboodiri (b) Prof. T. Tuytelaars(c) Sachin (Forinstalling ca↵e :P)
Thanks to all the committee membersThanks to my colleagues
Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 37 / 39
Some key ques,ons
• How relevant is DA in combina,on with CNN features ?
• Is integra,on of DA in end-‐to-‐end learning the ul,mate answer ?
• How do we properly evaluate DA methods ? • Beyond classifica,on ?
Some key ques,ons
• How relevant is DA in combina,on with CNN features ?
• Is integra,on of DA in end-‐to-‐end learning the ul,mate answer ?
• How do we properly evaluate DA methods ? • Beyond classifica,on ?
Domain Adapta,on
• Domain = a set of data defined by its marginal and condi,onal distribu,ons with respect to the assigned labels
• Dataset bias – Capture bias – Category or label bias – Nega,ve bias
How relevant is DA with CNN ? • Most widely used DA seVng for CV:
– Office Dataset – SURF + BoW representa,on
• Modern features (CNN) give be_er results even without DA
• “CNN features have been trained on very large datasets, to be robust to all possible types of variability. So we do not need any DA anymore.”
How relevant is DA with DNN?
• Always compare DA algorithms on top of the same representa,ons
• Performance on target depends on: -‐ performance on source -‐ domain divergence
-‐ model that fits both source and target (e.g. nega,ve bias, annotator bias)
How relevant is DA with DNN ?
• CNN features > SURF + BOW – More robust – Learned on larger datasets
• CNN features are data-‐driven – Be careful how to collect data, avoid unwanted biases
– May make them more sensi,ve to domain shids…
Some key ques,ons
• How relevant is DA in combina,on with CNN features ?
• Is integra,on of DA in end-‐to-‐end learning the ul,mate answer ?
• How do we properly evaluate DA methods ? • Beyond classifica,on ?
Integra,on of DA in DNN
• Learn representa,ons that yield good classifica,on AND are robust to domain changes, using deep / end-‐to-‐end learning
• “We can learn representa,ons that are even more robust using DA”
Prac,cal applica,ons
• Use pretrained models of ImageNet on my own photo-‐collec,on
• Use off-‐the-‐shelve pedestrian detector even under low illumina,on condi,ons
• …
-‐> calls for simple, light adapta,on methods
• Can we s,ll use the simple DA methods developed before ? (in par,cular, we focus on subspace based methods)
• Are methods developed for BoW-‐features suitable for CNN features ?
Some key ques,ons
• How relevant is DA in combina,on with CNN features ?
• Is integra,on of DA in end-‐to-‐end learning the ul,mate answer ?
• How do we properly evaluate DA methods ? • Beyond classifica,on ?
A cross-‐dataset benchmark
Cross-‐dataset benchmark
• Sparse Set
Cross-‐dataset benchmark
• Dense set
New evalua,on criterion
• % drop
• CD measure
A Deeper Look at Dataset Bias 5
1 5 10 50 100 500 10000
10
20
30
40
50
60
70
Num. of training samples per dataset
% R
ecog
nitio
n R
ate
Sparse set � 12 datasets
BOWsiftDeCAF6DeCAF7Chance
assigned
co
rre
ct
BOWsift
AW
A
AY
H
C1
01
ET
H
MS
R
OF
C
RG
BD
PA
S
SU
N
BIN
G
C2
56
IMG
AWAAYH
C101ETHMSROFC
RGBDPASSUN
BINGC256IMG
0
20
40
60
80
100
assigned
co
rre
ct
DeCAF7
AW
A
AY
H
C1
01
ET
H
MS
R
OF
C
RG
BD
PA
S
SU
N
BIN
G
C2
56
IMG
AWAAYH
C101ETHMSROFC
RGBDPASSUN
BINGC256IMG
0
20
40
60
80
100
Fig. 1 Name the dataset experiment over the sparse setup with 12 datasets. We use a linear SVM clas-sifier with C value tuned by 5-fold cross validation on the training set. We show average and standarddeviation results over 10 repetitions. The title of each confusion matrix indicates the feature used for thecorresponding experiments.
One way to quantitatively evaluate the cross dataset generalization was previouslyproposed in [39]. It consists of measuring the percentage drop (% Drop) betweenSelf and Mean Others. However, being a relative measure, it looses the informa-tion on the value of Self which is important if we want to compare the effect of dif-ferent learning methods or different representations. In fact a 75% drop w.r.t a 100%self average precision has a different meaning than a 75% drop w.r.t. a 25% self aver-age precision. To overcome this drawback, we propose here a different Cross-Dataset
(CD) measure defined as
CD =
1
1 + exp�{(Self�Mean Others)/100} .
CD uses directly the difference (Self�Mean Others) while the sigmoid functionrescales this value between 0 and 1. This allows for the comparison among the resultsof experiments with different setups. Specifically CD values over 0.5 indicate a pres-ence of a bias, which becomes more significant as CD gets close to 1. On the otherhand, CD values below 0.5 correspond to cases where either Mean Other ⇥ Selfor the Self result is very low. Both these conditions indicate that the learned model isnot reliable on the data of its own collection and it is difficult to draw any conclusionfrom its cross-dataset performance.
4 Studying the Sparse set
4.1 Name the Dataset
With the aim of getting an initial idea on the relation among the datasets and how dif-ferent representations capture their specific content, we start our analysis by runningthe name the dataset test on the sparse data setup. We extract randomly 1000 imagesfrom each of the 12 collections and we train a 12-way linear SVM classifier that wethen test on a disjoint set of 300 images. The experiment is repeated 10 times withdifferent data splits and we report the obtained average results in Figure 1. The plot onthe left shows the recognition rate as a function of the training set size and indicatesthat DeCAF allows for a much better separation among the collections than what isobtained with BOWsift. In particular DeCAF7 shows an advantage over DeCAF6 for
Some experimental results A Deeper Look at Dataset Bias 11
BOWsift DeCAF7
0
0.2
0.4
0.6
0.8
1
Reco
gn
itio
n R
ate
sp
oo
nb
asketb
all
ho
op
can
so
da
kn
ife
traff
iclig
ht
bath
tub
co
ffee
mu
gw
ind
mill
fire
exti
ng
uis
her
lad
der
lig
hth
ou
se
toaste
rkeyb
oard
kn
ob
can
no
nskyscra
per
teap
ot
do
gla
pto
pcake
tsh
irt
um
bre
lla
co
mp
ute
rm
on
ito
rco
mp
ute
rm
ou
se
ho
rse
mic
row
ave
peo
ple
refr
igera
tor
sn
eaker
billiard
sste
eri
ng
wh
eel
wash
ing
mach
ine
ch
an
delier
gra
nd
pia
no
palm
tree
backp
ack
helico
pte
rm
oto
rcycle
aero
pla
ne
car
C256 C256C256 IMGC256 SUN
0
0.2
0.4
0.6
0.8
1
Re
co
gn
itio
n R
ate
can
so
da
lad
der
kn
ife
cake
refr
igera
tor
kn
ob
traff
iclig
ht
fire
exti
ng
uis
her
backp
ack
can
no
nkeyb
oard
ho
rse
tsh
irt
bath
tub
peo
ple
sp
oo
nste
eri
ng
wh
eel
basketb
all
ho
op
co
mp
ute
rm
ou
se
wash
ing
mach
ine
mic
row
ave
sn
eaker
toaste
rw
ind
mill
billiard
sco
ffee
mu
gco
mp
ute
rm
on
ito
rd
og
helico
pte
ru
mb
rella
lap
top
lig
hth
ou
se
teap
ot
aero
pla
ne
ch
an
delier
skyscra
per
gra
nd
pia
no
palm
tree
car
mo
torc
ycle
C256 C256C256 IMGC256 SUN
assigned
co
rre
ct
C256 C256
people umbrella laptop
people
umbrella
laptop
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
assigned
co
rre
ct
C256 IMG
people umbrella laptop
people
umbrella
laptop
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Table 3 Recognition rate per class from the multiclass cross-dataset generalization test. C256, IMG andSUN stand respectively for Caltech256, Imagenet and SUN datasets. We indicate with “train-test” the pairof datasets used in training and testing.
Table 4 Left: Imagenet images annotated as Caltech256 data with BOWsift but correctly recognized withdecaf7. Right: Caltech256 images annotated as Imagenet by BOWsift but correctly recognized with De-CAF7.
Since all the datasets contain the same object classes, we are in fact reproducinga setup generally adopted for domain adaptation [16,12]. By simplifying the datasetbias problem and identifying each dataset with a domain, we can interpret the resultsof this experiment as an indication of the domain divergence [2] and deduce that amodel trained on SUN will perform poorly on the object-centric collections and viceversa. On the other hand, a better cross dataset generalization should be observedamong Imagenet, Caltech256 and Bing. We verify it in the following sections.
5.2 Cross-dataset generalization test.
We consider the same setup used before with 15 samples per class from each collec-tion in training and 5 samples per class in test. However, now we train a one-vs-all
Some experimental results 6 Tatiana Tommasi et al.
BOWsift % Drop CD DeCAF6 % Drop CD DeCAF7 % Drop CD
Car
test
trai
n
28.6
25.4
14.7
17.4
26.4
26.8
9.1
8.7
26.7
25.7
98.5
9.7
29.2
25.7
25.2
90.5
P S E M
P
S
E
M
3.9 0.50
4.3 0.50
83.4 0.69
86.8 0.69test
trai
n
26.2
15.4
27.3
41.6
19.2
15.6
18.7
12.8
44.3
19.9
100.0
95.8
42.6
17.9
93.5
99.7
P S E M
P
S
E
M
-35.1 0.47
-13.9 0.49
53.5 0.63
49.8 0.62test
trai
n
14.5
27.1
31.0
43.8
11.8
27.4
24.0
14.9
28.9
32.8
100.0
93.8
28.3
28.1
90.9
99.7
P S E M
P
S
E
M
-59.1 0.47
-6.8 0.49
51.3 0.62
49.0 0.62
Cow
test
trai
n
14.1
16.1
2.3
8.6
18.8
38.3
2.3
17.4
11.9
12.5
29.8
2.4
17.9
27.1
2.0
52.5
P S E M
P
S
E
M
-15.1 0.49
51.4 0.54
92.6 0.57
82.0 0.61test
trai
n35.3
18.6
4.4
13.8
34.4
62.7
8.3
46.7
20.2
10.7
91.9
9.3
39.1
33.5
4.1
97.2
P S E M
P
S
E
M
11.4 0.51
66.7 0.60
93.9 0.70
76.0 0.68test
trai
n
46.6
12.8
7.2
14.1
44.0
40.1
7.7
39.4
31.3
12.1
90.1
10.4
47.3
23.6
5.3
97.7
P S E M
P
S
E
M
12.3 0.51
59.7 0.56
92.5 0.70
78.2 0.68
Cow
-fixe
dne
gativ
es
test
trai
n
14.1
18.7
23.5
8.9
16.2
38.3
22.2
9.4
12.0
17.0
29.8
2.0
18.6
35.8
14.0
52.5
P S E M
P
S
E
M
-10.7 0.49
-37.8 0.53
33.3 0.52
87.1 0.61test
trai
n
35.3
38.5
6.1
65.2
32.7
62.7
7.9
62.6
16.9
21.0
91.9
51.5
46.6
69.5
4.6
97.2
P S E M
P
S
E
M
9.1 0.50
31.4 0.54
93.2 0.70
38.5 0.59test
trai
n
46.6
25.7
11.2
69.4
39.1
40.1
9.0
55.1
26.4
12.8
90.1
47.4
48.9
43.3
11.0
97.7
P S E M
P
S
E
M
18.2 0.52
31.9 0.53
88.4 0.69
41.3 0.59
Table 1 Binary cross-dataset generalization for two example categories, car and cow. Each matrix con-tains the object classification performance (AP) when training on one dataset (rows) and testing on an-other (columns). The diagonal elements correspond to the self results, i.e. training and testing on the samedataset. The percentage difference between the self results and the average of the other results per row cor-responds to the value indicated in the column “% Drop”. CD is our newly proposed cross-dataset measure.We report in bold the values higher than 0.5. P,S,E,M stand respectively for the datasets Pascal VOC07,SUN, ETH80, MSRCORID.
large number of training samples. From the confusion matrices (middle and right inFigure 1) we see that it is easy to distinguish ETH80, Office and RGB-D datasetsfrom all the others regardless of the used representation, given the specific lab-natureof these collections. DeCAF captures better than BOWsift the characteristics of A-Yahoo, MSRCORID, Pascal VOC07 and SUN, improving the recognition results onthem. Finally, Bing, Caltech256 and Imagenet are the datasets with the highest con-fusion level, an effect mainly due to the large number of classes and images per class.Still, this confusion decreases when using DeCAF.
The information obtained in this way over the whole collections provides us onlywith a small part of the full picture about the bias problem. The dataset recognitionperformance does not give an insight on how the classes in each collection are relatedamong each other, nor how a model defined for a class in one dataset will generalizeto the others. We look into this problem in the following paragraph.
4.2 Cross-dataset generalization test
With a procedure similar to that in [39], we perform a cross-dataset generalizationexperiment over two specific object classes shared among multiple datasets: car andcow. For the class car we selected randomly from four collections of the sparse settwo groups of 50 positive and 1000 negative examples respectively for training and
Undoing dataset bias
8 Tatiana Tommasi et al.
are kept separated as belonging to different tasks and a max-margin model is learnedfrom the information shared over all of them.
4.3.1 Method.
Let us assume to have i = 1, . . . , n datasets D1, . . . , Dn each consisting of si train-ing samples Di = {(xi
1, yi1), . . . , (x
isi , y
isi)}. Here x
ij ⇤ Rm represents the m-
dimensional feature vector and yij ⇤ {�1, 1} represents the label for the examplej of dataset Di. Specifically all the datasets share an object class and its images areannotated with label 1, while all the other samples in the different datasets have la-bel -1. The Unbias algorithm [25] consists in learning a binary model per datasetwi = wvw + �i, where wvw is a model for the visual world, while �i is the biasfor each dataset. These two parts are obtained by solving the following optimizationproblem:
minwvw,�i
⇥,⇤
1
2�wvw�2 +
�
2
n�
i=1
�⇥i�2 + C1
n�
i=1
si�
j=1
⇤ij + C2
n�
i=1
si�
j=1
⌅ij (1)
subject to wi = wvw +�i (2)yijwvw · xi
j ⇥ 1� ⇥ij (3)
yijwi · xij ⇥ 1� ⇤ij (4)
⇥ij ⇥ 0, ⇤ij ⇥ 0 (5)i = 1, . . . , n j = 1, . . . , si . (6)
S C1 P O S C1 P O�40
�20
0
20
40
60
%A
P d
iffer
ence
aga
inst
bas
elin
e
Car � 3 datasets
BOWsiftDeCAF7
E P A C1 S C2 O E P A C1 S C2 O�20
0
20
40
60
80
%A
P d
iffer
ence
aga
inst
bas
elin
e
Dog
BOWsiftDeCAF7
M P S A E O M P S A E O�20
0
20
40
60
80
%A
P d
iffer
ence
aga
inst
bas
elin
e
Cow
BOWsiftDeCAF7
P S C1 E M O P S C1 E M O�30
�20
�10
0
10
20
%A
P d
iffer
ence
aga
inst
bas
elin
e
Car � 5 datasets
BOWsiftDeCAF7
P S C1 OF M O P S C1 OF M O�20
0
20
40
60
80
100%
AP
diff
eren
ce a
gain
st b
asel
ine
Chair
BOWsiftDeCAF7 Cow Self Other
M 98.97 86.56P 53.67 34.29S 40.89 29.20A 44.76 25.94E 99.40 86.56
Fig. 2 Percentage difference in average precision between the results of Unbias and the baseline All overeach target dataset. P,S,E,M,A,C1,C2,OF stand respectively for the datasets Pascal VOC07, SUN, ETH80,MSRCORID, AwA, Caltech101, Caltech256 and Office. With O we indicate the overall value, i.e. theaverage of the percentage difference over all the considered datasets (shown in black).
DA methods with DeCAF7 16 Tatiana Tommasi et al.
5 10 15 20 25 30 35 40 45 50 550.3
0.4
0.5
0.6
0.7
0.8
0.9
# training samples per class
Rec
ogni
tion
Rat
e
40 classes Bing�Caltech256 experiments
Bing�CaltechCaltech�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling
0 5 100.5
0.6
0.7
self�labelling steps
Rec
ogni
tion
Rat
e DeCAF7
0 5 100.1
0.105
0.11
0.115
self�labelling stepsR
ecog
nitio
n R
ate BOWsift
5 10 15 20 25 30 35 40 45 50 550.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
# training samples per class
Rec
ogni
tion
Rat
e
40 classes Bing�SUN experiments
Bing�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling
0 5 100.16
0.18
0.2
0.22
self�labelling steps
Rec
ogni
tion
Rat
e DeCAF7
0 5 100.056
0.058
0.06
0.062
self�labelling steps
Rec
ogni
tion
Rat
e BOWsift
Fig. 4 Results of the Bing-Caltech256 and Bing-SUN experiments with DeCAF7. We report the perfor-mance of different domain adaptation methods (big plots) together with the recognition rate obtained in 10subsequent steps of the self-labelling procedure (small plots). For the last ones we show the performanceobtained both with DeCAF7 and and with BOWsift when having originally 10 samples per class fromBing.
In Figure 4 we present the results obtained using DECAF7 with the landmarkmethod, and the reshape approach combined respectively with SA and Domain Adap-tation Machine (DAM [8]). More in details, we use the reshape method to divide theBing images into two subgroups and we apply SA to adapt each of them to the tar-get. We identify the best subgroup as that with the highest target performance andwe report its results. The subgroups are instead considered as two source domainswith DAM: we assign different weights to their relative importance and show alsoin this case the best obtained performance4. As reference we also present the per-formance of the SA and DAM method without reshaping. Finally we test a simpleself-labelling strategy that was already used in [38]. Differently from the previouslydescribed techniques this method learns a model over the full set of Bing data andprogressively selects target samples to augment the training set.
The obtained results go in the same direction of what observed previously withthe Unbias method. Despite the presence of noisy data, subselecting them (with land-mark) or grouping the samples (reshape+SA, reshape+DAM) do not seem to workbetter than just using all the source data at once. On the other way round self-labelling
[38] consistently improves the original results with a significant gain in performanceespecially when only a reduced set of training images per class is available. One wellknown drawback of this strategy is that subsequent errors in the target annotationsmay lead to significant drift from the correct solution. However, when working withDeCAF features this risk appears highly reduced: this can be appreciated looking atthe recognition rate obtained over ten iterations of the target selection procedure, con-sidering in particular the comparison against the corresponding performance obtainedwhen using BOWsift (see the small plots in Figure 4).
6 Conclusions
In this paper we attempted at positioning the dataset bias problem in the CNN-based features arena with an extensive experimental evaluation. At the same time,we pushed the envelope in terms of the scale and complexity of the evaluation proto-
4 More details on the experimental setup can be found in the supplementary material.
Conclusions
• CNN features (DeCaf7) are more powerful, but this by itself doesn’t suffice to avoid domain shid
• Some,mes too specific and even worse performance (both class – and dataset dependent)
• Nega,ve bias problem remains • Standard DA methods do not always perform well on CNN-‐features – more difficult to generalize ?
• Best results with self-‐labeling on target data
Adap,ng R-‐CNN detector from Pascal VOC to Microsod COCO
RCNN Detector
Figure : Object detection system overview using RCNN. RCNN (1) takes an inputimage, (2) extracts around 2000 bottom-up region proposals, (3) computesfeatures for each proposal using a large convolutional neural network (CNN), andthen (4) classifies each region using class-specific linear SVMs.
Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 18 / 39
6 RAJ, NAMBOODIRI, TUYTELAARS: ADAPTING RCNN DETECTOR
(a) PASCAL (b) PASCAL (c) PASCAL (d) COCO (e) COCO (f) COCOFigure 1: Image samples taken from COCO validation set and PASCAL VOC 2012 to showdomain shift
5.1 Experimental Setup
PASCAL dataset is simpler than COCO and also smaller hence it has been considered assource dataset and COCO is considered as target dataset. For target dataset(COCO) sub-space generation, the examples having a classifier score greater than a threshold of 0.4 areconsidered positive examples for the target subspace. Non maximum supression is removedfrom the process for consistency in the number of samples for subspace generation. Forsource dataset(PASCAL), we evaluate the overlap of each object proposal region with theground truth bounding box and consider the bounding boxes with threshold greater than 0.7with the ground truth bounding box as candidates for our source subspace generation. Thedimensionality of subspace for both the source and target dataset is kept fixed at 100. Sub-space alignment is done between source and target subspaces and source data set is projectedon the target aligned source subspace for further training. Once the new detectors are trainedon the transformed feature the same procedure is applied as RCNN for detection on projectedtarget image features.5.2 Results
Here in this section we provide evidence to show the performance of our method, compareour results to other baselines and analyse the results. First we consider the statistical dif-ference between the PASCAL VOC 2012 and COCO validation set data. We run RCNNdetector on both source and target data. We plot the histograms of score obtained for boththe dataset. It can be observed from their histogram in image 2 that there exists statistical dis-similarity between both these datasets. Therefore, there is a need for domain adaptation. Thehistogram evaluation has been done in two setting, first in a category wise setting and secondas a full dataset jointly. The findings of both these settings is similar and demonstrates thestatistical dissimilarity between these two datasets.
Figure 2: Histogram of scores. Fig 1 is for CoCo dataset and fig 2 for PASCAL VOC. Scoresis taken along x-axis and no. of object region with that score along y-axis
Now we evaluate our method on these two datasets. First baseline to compare with ourmethod is a simple RCNN detector trained on PASCAL VOC 2012. We evaluate the perfor-mance of RCNN detector for all the 20 categories which are there in the PASCAL datasets.
• Ini,alize the detec,on on Target dataset • Avoid non maximum suppression while learning target subspace
• Discard noisy detec,ons while ini,aliza,on • Ini,al RCNN is trained on Pascal VOC 2012 • We generate class specific target subspaces and apply subspace alignment approach on each class separately
Adapting RCNN Detector Result
No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform
1 plane 36.72 35.44 40.1 35.12 bicycle 21.26 18.95 23.28 1.93 bird 12.50 12.37 13.63 3.74 boat 10.45 8.8 10.61 2.35 bottle 8.75 11.46 8.11 76 bus 37.47 38.12 40.64 45.4
7 car 20.6 20.4 22.5 18.38 cat 42.4 43.6 45.6 8.69 chair 9.6 6.3 8.8 6.310 cow 23.28 20.40 25.3 1711 table 15.9 14.9 17.3 4.812 dog 28.42 32.72 31.3 5.813 horse 30.7 31.11 32.9 35.3
Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 29 / 39
Adapting RCNN Detector Result
No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform
14 motorbike 31.2 29.05 34.6 25.415 person 27.8 28.8 30.9 17.516 plant 12.65 7.34 13.7 4.117 sheep 19.99 21.04 22.4 14.518 sofa 14.6 8.4 15.5 9.619 train 39.2 38.4 41.64 31.720 tv 28.6 26.4 29.9 27.9
Mean AP 23.60 22.7 25.43 16.9
Table : Domain adaption detection result on validation set of COCO dataset.
Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 30 / 39
Questions ?