domainadaptaon...

27
Domain Adapta,on in a deep learning context Tinne Tuytelaars

Upload: others

Post on 11-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Domain  Adapta,on    in  a  deep  learning  context  

Tinne  Tuytelaars  

Page 2: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Work  in  collabora,on  with                                                                                                                                      …    •  T.  Tommasi,  N.  Patricia,  B.  Caputo,  T.  Tuytelaars  "A  Deeper  

Look  at  Dataset  Bias"  GCPR  2015  •  A.  Raj,  V.  Namboodiri  and  T.  Tuytelaars,  "Subspace  Alignment  

Based  Domain  Adapta9on  for  RCNN  Detector",  BMVC  2015.  

Thanks

Thanks to

(a) Prof. R. Hegde (b) Prof. A. MukerjeeSpecial thanks to

(a) Prof. V. Namboodiri (b) Prof. T. Tuytelaars(c) Sachin (Forinstalling ca↵e :P)

Thanks to all the committee membersThanks to my colleagues

Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 37 / 39

Page 3: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  key  ques,ons  

•  How  relevant  is  DA  in  combina,on  with  CNN  features  ?  

•  Is  integra,on  of  DA  in  end-­‐to-­‐end  learning  the  ul,mate  answer  ?    

•  How  do  we  properly  evaluate  DA  methods  ?  •  Beyond  classifica,on  ?  

Page 4: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  key  ques,ons  

•  How  relevant  is  DA  in  combina,on  with  CNN  features  ?  

•  Is  integra,on  of  DA  in  end-­‐to-­‐end  learning  the  ul,mate  answer  ?    

•  How  do  we  properly  evaluate  DA  methods  ?  •  Beyond  classifica,on  ?  

Page 5: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Domain  Adapta,on  

•  Domain    =  a  set  of  data  defined  by  its  marginal  and  condi,onal  distribu,ons  with  respect  to  the  assigned  labels    

•  Dataset  bias  – Capture  bias  – Category  or  label  bias  – Nega,ve  bias  

Page 6: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

How  relevant  is  DA  with  CNN  ?  •  Most  widely  used  DA  seVng  for  CV:  

– Office  Dataset  –  SURF  +  BoW  representa,on  

•  Modern  features  (CNN)  give  be_er  results  even  without  DA  

•  “CNN  features  have  been  trained  on  very  large  datasets,  to  be  robust  to  all  possible  types  of  variability.  So  we  do  not  need  any  DA  anymore.”  

Page 7: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

How  relevant  is  DA  with  DNN?  

•  Always  compare  DA  algorithms  on  top  of  the  same  representa,ons  

 •  Performance  on  target  depends  on:    -­‐  performance  on  source    -­‐  domain  divergence    

     -­‐  model  that  fits  both  source  and  target      (e.g.  nega,ve  bias,  annotator  bias)  

Page 8: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

How  relevant  is  DA  with  DNN  ?    

•  CNN  features  >  SURF  +  BOW  – More  robust  – Learned  on  larger  datasets  

   

•  CNN  features  are  data-­‐driven  – Be  careful  how  to  collect  data,  avoid  unwanted  biases  

– May  make  them  more  sensi,ve  to  domain  shids…    

Page 9: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  key  ques,ons  

•  How  relevant  is  DA  in  combina,on  with  CNN  features  ?  

•  Is  integra,on  of  DA  in  end-­‐to-­‐end  learning  the  ul,mate  answer  ?    

•  How  do  we  properly  evaluate  DA  methods  ?  •  Beyond  classifica,on  ?  

Page 10: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Integra,on  of  DA  in  DNN  

•  Learn  representa,ons  that  yield  good  classifica,on  AND  are  robust  to  domain  changes,  using  deep  /  end-­‐to-­‐end  learning  

•  “We  can  learn  representa,ons  that  are  even  more  robust  using  DA”  

Page 11: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Prac,cal  applica,ons  

•  Use  pretrained  models  of  ImageNet  on  my  own  photo-­‐collec,on  

•  Use  off-­‐the-­‐shelve  pedestrian  detector  even  under  low  illumina,on  condi,ons  

•  …  

-­‐>  calls  for  simple,  light  adapta,on  methods  

Page 12: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

•  Can  we  s,ll  use  the  simple  DA  methods  developed  before  ?    (in  par,cular,  we  focus  on  subspace  based  methods)  

•  Are  methods  developed  for  BoW-­‐features  suitable  for  CNN  features  ?    

Page 13: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  key  ques,ons  

•  How  relevant  is  DA  in  combina,on  with  CNN  features  ?  

•  Is  integra,on  of  DA  in  end-­‐to-­‐end  learning  the  ul,mate  answer  ?    

•  How  do  we  properly  evaluate  DA  methods  ?  •  Beyond  classifica,on  ?  

Page 14: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

A  cross-­‐dataset  benchmark  

Page 15: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Cross-­‐dataset  benchmark  

•  Sparse  Set  

Page 16: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Cross-­‐dataset  benchmark  

•  Dense  set  

Page 17: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

New  evalua,on  criterion  

•  %  drop  

•  CD  measure  

A Deeper Look at Dataset Bias 5

1 5 10 50 100 500 10000

10

20

30

40

50

60

70

Num. of training samples per dataset

% R

ecog

nitio

n R

ate

Sparse set � 12 datasets

BOWsiftDeCAF6DeCAF7Chance

assigned

co

rre

ct

BOWsift

AW

A

AY

H

C1

01

ET

H

MS

R

OF

C

RG

BD

PA

S

SU

N

BIN

G

C2

56

IMG

AWAAYH

C101ETHMSROFC

RGBDPASSUN

BINGC256IMG

0

20

40

60

80

100

assigned

co

rre

ct

DeCAF7

AW

A

AY

H

C1

01

ET

H

MS

R

OF

C

RG

BD

PA

S

SU

N

BIN

G

C2

56

IMG

AWAAYH

C101ETHMSROFC

RGBDPASSUN

BINGC256IMG

0

20

40

60

80

100

Fig. 1 Name the dataset experiment over the sparse setup with 12 datasets. We use a linear SVM clas-sifier with C value tuned by 5-fold cross validation on the training set. We show average and standarddeviation results over 10 repetitions. The title of each confusion matrix indicates the feature used for thecorresponding experiments.

One way to quantitatively evaluate the cross dataset generalization was previouslyproposed in [39]. It consists of measuring the percentage drop (% Drop) betweenSelf and Mean Others. However, being a relative measure, it looses the informa-tion on the value of Self which is important if we want to compare the effect of dif-ferent learning methods or different representations. In fact a 75% drop w.r.t a 100%self average precision has a different meaning than a 75% drop w.r.t. a 25% self aver-age precision. To overcome this drawback, we propose here a different Cross-Dataset

(CD) measure defined as

CD =

1

1 + exp�{(Self�Mean Others)/100} .

CD uses directly the difference (Self�Mean Others) while the sigmoid functionrescales this value between 0 and 1. This allows for the comparison among the resultsof experiments with different setups. Specifically CD values over 0.5 indicate a pres-ence of a bias, which becomes more significant as CD gets close to 1. On the otherhand, CD values below 0.5 correspond to cases where either Mean Other ⇥ Selfor the Self result is very low. Both these conditions indicate that the learned model isnot reliable on the data of its own collection and it is difficult to draw any conclusionfrom its cross-dataset performance.

4 Studying the Sparse set

4.1 Name the Dataset

With the aim of getting an initial idea on the relation among the datasets and how dif-ferent representations capture their specific content, we start our analysis by runningthe name the dataset test on the sparse data setup. We extract randomly 1000 imagesfrom each of the 12 collections and we train a 12-way linear SVM classifier that wethen test on a disjoint set of 300 images. The experiment is repeated 10 times withdifferent data splits and we report the obtained average results in Figure 1. The plot onthe left shows the recognition rate as a function of the training set size and indicatesthat DeCAF allows for a much better separation among the collections than what isobtained with BOWsift. In particular DeCAF7 shows an advantage over DeCAF6 for

Page 18: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  experimental  results  A Deeper Look at Dataset Bias 11

BOWsift DeCAF7

0

0.2

0.4

0.6

0.8

1

Reco

gn

itio

n R

ate

sp

oo

nb

asketb

all

ho

op

can

so

da

kn

ife

traff

iclig

ht

bath

tub

co

ffee

mu

gw

ind

mill

fire

exti

ng

uis

her

lad

der

lig

hth

ou

se

toaste

rkeyb

oard

kn

ob

can

no

nskyscra

per

teap

ot

do

gla

pto

pcake

tsh

irt

um

bre

lla

co

mp

ute

rm

on

ito

rco

mp

ute

rm

ou

se

ho

rse

mic

row

ave

peo

ple

refr

igera

tor

sn

eaker

billiard

sste

eri

ng

wh

eel

wash

ing

mach

ine

ch

an

delier

gra

nd

pia

no

palm

tree

backp

ack

helico

pte

rm

oto

rcycle

aero

pla

ne

car

C256 C256C256 IMGC256 SUN

0

0.2

0.4

0.6

0.8

1

Re

co

gn

itio

n R

ate

can

so

da

lad

der

kn

ife

cake

refr

igera

tor

kn

ob

traff

iclig

ht

fire

exti

ng

uis

her

backp

ack

can

no

nkeyb

oard

ho

rse

tsh

irt

bath

tub

peo

ple

sp

oo

nste

eri

ng

wh

eel

basketb

all

ho

op

co

mp

ute

rm

ou

se

wash

ing

mach

ine

mic

row

ave

sn

eaker

toaste

rw

ind

mill

billiard

sco

ffee

mu

gco

mp

ute

rm

on

ito

rd

og

helico

pte

ru

mb

rella

lap

top

lig

hth

ou

se

teap

ot

aero

pla

ne

ch

an

delier

skyscra

per

gra

nd

pia

no

palm

tree

car

mo

torc

ycle

C256 C256C256 IMGC256 SUN

assigned

co

rre

ct

C256 C256

people umbrella laptop

people

umbrella

laptop

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

assigned

co

rre

ct

C256 IMG

people umbrella laptop

people

umbrella

laptop

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Table 3 Recognition rate per class from the multiclass cross-dataset generalization test. C256, IMG andSUN stand respectively for Caltech256, Imagenet and SUN datasets. We indicate with “train-test” the pairof datasets used in training and testing.

Table 4 Left: Imagenet images annotated as Caltech256 data with BOWsift but correctly recognized withdecaf7. Right: Caltech256 images annotated as Imagenet by BOWsift but correctly recognized with De-CAF7.

Since all the datasets contain the same object classes, we are in fact reproducinga setup generally adopted for domain adaptation [16,12]. By simplifying the datasetbias problem and identifying each dataset with a domain, we can interpret the resultsof this experiment as an indication of the domain divergence [2] and deduce that amodel trained on SUN will perform poorly on the object-centric collections and viceversa. On the other hand, a better cross dataset generalization should be observedamong Imagenet, Caltech256 and Bing. We verify it in the following sections.

5.2 Cross-dataset generalization test.

We consider the same setup used before with 15 samples per class from each collec-tion in training and 5 samples per class in test. However, now we train a one-vs-all

Page 19: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Some  experimental  results  6 Tatiana Tommasi et al.

BOWsift % Drop CD DeCAF6 % Drop CD DeCAF7 % Drop CD

Car

test

trai

n

28.6

25.4

14.7

17.4

26.4

26.8

9.1

8.7

26.7

25.7

98.5

9.7

29.2

25.7

25.2

90.5

P S E M

P

S

E

M

3.9 0.50

4.3 0.50

83.4 0.69

86.8 0.69test

trai

n

26.2

15.4

27.3

41.6

19.2

15.6

18.7

12.8

44.3

19.9

100.0

95.8

42.6

17.9

93.5

99.7

P S E M

P

S

E

M

-35.1 0.47

-13.9 0.49

53.5 0.63

49.8 0.62test

trai

n

14.5

27.1

31.0

43.8

11.8

27.4

24.0

14.9

28.9

32.8

100.0

93.8

28.3

28.1

90.9

99.7

P S E M

P

S

E

M

-59.1 0.47

-6.8 0.49

51.3 0.62

49.0 0.62

Cow

test

trai

n

14.1

16.1

2.3

8.6

18.8

38.3

2.3

17.4

11.9

12.5

29.8

2.4

17.9

27.1

2.0

52.5

P S E M

P

S

E

M

-15.1 0.49

51.4 0.54

92.6 0.57

82.0 0.61test

trai

n35.3

18.6

4.4

13.8

34.4

62.7

8.3

46.7

20.2

10.7

91.9

9.3

39.1

33.5

4.1

97.2

P S E M

P

S

E

M

11.4 0.51

66.7 0.60

93.9 0.70

76.0 0.68test

trai

n

46.6

12.8

7.2

14.1

44.0

40.1

7.7

39.4

31.3

12.1

90.1

10.4

47.3

23.6

5.3

97.7

P S E M

P

S

E

M

12.3 0.51

59.7 0.56

92.5 0.70

78.2 0.68

Cow

-fixe

dne

gativ

es

test

trai

n

14.1

18.7

23.5

8.9

16.2

38.3

22.2

9.4

12.0

17.0

29.8

2.0

18.6

35.8

14.0

52.5

P S E M

P

S

E

M

-10.7 0.49

-37.8 0.53

33.3 0.52

87.1 0.61test

trai

n

35.3

38.5

6.1

65.2

32.7

62.7

7.9

62.6

16.9

21.0

91.9

51.5

46.6

69.5

4.6

97.2

P S E M

P

S

E

M

9.1 0.50

31.4 0.54

93.2 0.70

38.5 0.59test

trai

n

46.6

25.7

11.2

69.4

39.1

40.1

9.0

55.1

26.4

12.8

90.1

47.4

48.9

43.3

11.0

97.7

P S E M

P

S

E

M

18.2 0.52

31.9 0.53

88.4 0.69

41.3 0.59

Table 1 Binary cross-dataset generalization for two example categories, car and cow. Each matrix con-tains the object classification performance (AP) when training on one dataset (rows) and testing on an-other (columns). The diagonal elements correspond to the self results, i.e. training and testing on the samedataset. The percentage difference between the self results and the average of the other results per row cor-responds to the value indicated in the column “% Drop”. CD is our newly proposed cross-dataset measure.We report in bold the values higher than 0.5. P,S,E,M stand respectively for the datasets Pascal VOC07,SUN, ETH80, MSRCORID.

large number of training samples. From the confusion matrices (middle and right inFigure 1) we see that it is easy to distinguish ETH80, Office and RGB-D datasetsfrom all the others regardless of the used representation, given the specific lab-natureof these collections. DeCAF captures better than BOWsift the characteristics of A-Yahoo, MSRCORID, Pascal VOC07 and SUN, improving the recognition results onthem. Finally, Bing, Caltech256 and Imagenet are the datasets with the highest con-fusion level, an effect mainly due to the large number of classes and images per class.Still, this confusion decreases when using DeCAF.

The information obtained in this way over the whole collections provides us onlywith a small part of the full picture about the bias problem. The dataset recognitionperformance does not give an insight on how the classes in each collection are relatedamong each other, nor how a model defined for a class in one dataset will generalizeto the others. We look into this problem in the following paragraph.

4.2 Cross-dataset generalization test

With a procedure similar to that in [39], we perform a cross-dataset generalizationexperiment over two specific object classes shared among multiple datasets: car andcow. For the class car we selected randomly from four collections of the sparse settwo groups of 50 positive and 1000 negative examples respectively for training and

Page 20: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Undoing  dataset  bias  

8 Tatiana Tommasi et al.

are kept separated as belonging to different tasks and a max-margin model is learnedfrom the information shared over all of them.

4.3.1 Method.

Let us assume to have i = 1, . . . , n datasets D1, . . . , Dn each consisting of si train-ing samples Di = {(xi

1, yi1), . . . , (x

isi , y

isi)}. Here x

ij ⇤ Rm represents the m-

dimensional feature vector and yij ⇤ {�1, 1} represents the label for the examplej of dataset Di. Specifically all the datasets share an object class and its images areannotated with label 1, while all the other samples in the different datasets have la-bel -1. The Unbias algorithm [25] consists in learning a binary model per datasetwi = wvw + �i, where wvw is a model for the visual world, while �i is the biasfor each dataset. These two parts are obtained by solving the following optimizationproblem:

minwvw,�i

⇥,⇤

1

2�wvw�2 +

2

n�

i=1

�⇥i�2 + C1

n�

i=1

si�

j=1

⇤ij + C2

n�

i=1

si�

j=1

⌅ij (1)

subject to wi = wvw +�i (2)yijwvw · xi

j ⇥ 1� ⇥ij (3)

yijwi · xij ⇥ 1� ⇤ij (4)

⇥ij ⇥ 0, ⇤ij ⇥ 0 (5)i = 1, . . . , n j = 1, . . . , si . (6)

S C1 P O S C1 P O�40

�20

0

20

40

60

%A

P d

iffer

ence

aga

inst

bas

elin

e

Car � 3 datasets

BOWsiftDeCAF7

E P A C1 S C2 O E P A C1 S C2 O�20

0

20

40

60

80

%A

P d

iffer

ence

aga

inst

bas

elin

e

Dog

BOWsiftDeCAF7

M P S A E O M P S A E O�20

0

20

40

60

80

%A

P d

iffer

ence

aga

inst

bas

elin

e

Cow

BOWsiftDeCAF7

P S C1 E M O P S C1 E M O�30

�20

�10

0

10

20

%A

P d

iffer

ence

aga

inst

bas

elin

e

Car � 5 datasets

BOWsiftDeCAF7

P S C1 OF M O P S C1 OF M O�20

0

20

40

60

80

100%

AP

diff

eren

ce a

gain

st b

asel

ine

Chair

BOWsiftDeCAF7 Cow Self Other

M 98.97 86.56P 53.67 34.29S 40.89 29.20A 44.76 25.94E 99.40 86.56

Fig. 2 Percentage difference in average precision between the results of Unbias and the baseline All overeach target dataset. P,S,E,M,A,C1,C2,OF stand respectively for the datasets Pascal VOC07, SUN, ETH80,MSRCORID, AwA, Caltech101, Caltech256 and Office. With O we indicate the overall value, i.e. theaverage of the percentage difference over all the considered datasets (shown in black).

Page 21: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

DA  methods  with  DeCAF7    16 Tatiana Tommasi et al.

5 10 15 20 25 30 35 40 45 50 550.3

0.4

0.5

0.6

0.7

0.8

0.9

# training samples per class

Rec

ogni

tion

Rat

e

40 classes Bing�Caltech256 experiments

Bing�CaltechCaltech�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling

0 5 100.5

0.6

0.7

self�labelling steps

Rec

ogni

tion

Rat

e DeCAF7

0 5 100.1

0.105

0.11

0.115

self�labelling stepsR

ecog

nitio

n R

ate BOWsift

5 10 15 20 25 30 35 40 45 50 550.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

# training samples per class

Rec

ogni

tion

Rat

e

40 classes Bing�SUN experiments

Bing�CaltechlandmarkSAreshape+SADAMreshape+DAMself�labelling

0 5 100.16

0.18

0.2

0.22

self�labelling steps

Rec

ogni

tion

Rat

e DeCAF7

0 5 100.056

0.058

0.06

0.062

self�labelling steps

Rec

ogni

tion

Rat

e BOWsift

Fig. 4 Results of the Bing-Caltech256 and Bing-SUN experiments with DeCAF7. We report the perfor-mance of different domain adaptation methods (big plots) together with the recognition rate obtained in 10subsequent steps of the self-labelling procedure (small plots). For the last ones we show the performanceobtained both with DeCAF7 and and with BOWsift when having originally 10 samples per class fromBing.

In Figure 4 we present the results obtained using DECAF7 with the landmarkmethod, and the reshape approach combined respectively with SA and Domain Adap-tation Machine (DAM [8]). More in details, we use the reshape method to divide theBing images into two subgroups and we apply SA to adapt each of them to the tar-get. We identify the best subgroup as that with the highest target performance andwe report its results. The subgroups are instead considered as two source domainswith DAM: we assign different weights to their relative importance and show alsoin this case the best obtained performance4. As reference we also present the per-formance of the SA and DAM method without reshaping. Finally we test a simpleself-labelling strategy that was already used in [38]. Differently from the previouslydescribed techniques this method learns a model over the full set of Bing data andprogressively selects target samples to augment the training set.

The obtained results go in the same direction of what observed previously withthe Unbias method. Despite the presence of noisy data, subselecting them (with land-mark) or grouping the samples (reshape+SA, reshape+DAM) do not seem to workbetter than just using all the source data at once. On the other way round self-labelling

[38] consistently improves the original results with a significant gain in performanceespecially when only a reduced set of training images per class is available. One wellknown drawback of this strategy is that subsequent errors in the target annotationsmay lead to significant drift from the correct solution. However, when working withDeCAF features this risk appears highly reduced: this can be appreciated looking atthe recognition rate obtained over ten iterations of the target selection procedure, con-sidering in particular the comparison against the corresponding performance obtainedwhen using BOWsift (see the small plots in Figure 4).

6 Conclusions

In this paper we attempted at positioning the dataset bias problem in the CNN-based features arena with an extensive experimental evaluation. At the same time,we pushed the envelope in terms of the scale and complexity of the evaluation proto-

4 More details on the experimental setup can be found in the supplementary material.

Page 22: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Conclusions  

•  CNN  features  (DeCaf7)    are  more  powerful,  but  this  by  itself  doesn’t  suffice  to  avoid  domain  shid  

•  Some,mes  too  specific  and  even  worse  performance  (both  class  –  and  dataset  dependent)  

•  Nega,ve  bias  problem  remains  •  Standard  DA  methods  do  not  always  perform  well  on  CNN-­‐features  –  more  difficult  to  generalize  ?  

•  Best  results  with  self-­‐labeling  on  target  data  

Page 23: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Adap,ng  R-­‐CNN  detector    from  Pascal  VOC  to  Microsod  COCO  

RCNN Detector

Figure : Object detection system overview using RCNN. RCNN (1) takes an inputimage, (2) extracts around 2000 bottom-up region proposals, (3) computesfeatures for each proposal using a large convolutional neural network (CNN), andthen (4) classifies each region using class-specific linear SVMs.

Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 18 / 39

6 RAJ, NAMBOODIRI, TUYTELAARS: ADAPTING RCNN DETECTOR

(a) PASCAL (b) PASCAL (c) PASCAL (d) COCO (e) COCO (f) COCOFigure 1: Image samples taken from COCO validation set and PASCAL VOC 2012 to showdomain shift

5.1 Experimental Setup

PASCAL dataset is simpler than COCO and also smaller hence it has been considered assource dataset and COCO is considered as target dataset. For target dataset(COCO) sub-space generation, the examples having a classifier score greater than a threshold of 0.4 areconsidered positive examples for the target subspace. Non maximum supression is removedfrom the process for consistency in the number of samples for subspace generation. Forsource dataset(PASCAL), we evaluate the overlap of each object proposal region with theground truth bounding box and consider the bounding boxes with threshold greater than 0.7with the ground truth bounding box as candidates for our source subspace generation. Thedimensionality of subspace for both the source and target dataset is kept fixed at 100. Sub-space alignment is done between source and target subspaces and source data set is projectedon the target aligned source subspace for further training. Once the new detectors are trainedon the transformed feature the same procedure is applied as RCNN for detection on projectedtarget image features.5.2 Results

Here in this section we provide evidence to show the performance of our method, compareour results to other baselines and analyse the results. First we consider the statistical dif-ference between the PASCAL VOC 2012 and COCO validation set data. We run RCNNdetector on both source and target data. We plot the histograms of score obtained for boththe dataset. It can be observed from their histogram in image 2 that there exists statistical dis-similarity between both these datasets. Therefore, there is a need for domain adaptation. Thehistogram evaluation has been done in two setting, first in a category wise setting and secondas a full dataset jointly. The findings of both these settings is similar and demonstrates thestatistical dissimilarity between these two datasets.

Figure 2: Histogram of scores. Fig 1 is for CoCo dataset and fig 2 for PASCAL VOC. Scoresis taken along x-axis and no. of object region with that score along y-axis

Now we evaluate our method on these two datasets. First baseline to compare with ourmethod is a simple RCNN detector trained on PASCAL VOC 2012. We evaluate the perfor-mance of RCNN detector for all the 20 categories which are there in the PASCAL datasets.

Page 24: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

•  Ini,alize  the  detec,on  on  Target  dataset    •  Avoid  non  maximum  suppression  while  learning  target  subspace    

•  Discard  noisy  detec,ons  while  ini,aliza,on    •  Ini,al  RCNN  is  trained  on  Pascal  VOC  2012    •  We  generate  class  specific  target  subspaces  and  apply  subspace  alignment  approach  on  each  class  separately    

Page 25: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Adapting RCNN Detector Result

No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform

1 plane 36.72 35.44 40.1 35.12 bicycle 21.26 18.95 23.28 1.93 bird 12.50 12.37 13.63 3.74 boat 10.45 8.8 10.61 2.35 bottle 8.75 11.46 8.11 76 bus 37.47 38.12 40.64 45.4

7 car 20.6 20.4 22.5 18.38 cat 42.4 43.6 45.6 8.69 chair 9.6 6.3 8.8 6.310 cow 23.28 20.40 25.3 1711 table 15.9 14.9 17.3 4.812 dog 28.42 32.72 31.3 5.813 horse 30.7 31.11 32.9 35.3

Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 29 / 39

Page 26: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Adapting RCNN Detector Result

No. class RCNN- RCNN - Proposed DPMNo Transform Full Transform

14 motorbike 31.2 29.05 34.6 25.415 person 27.8 28.8 30.9 17.516 plant 12.65 7.34 13.7 4.117 sheep 19.99 21.04 22.4 14.518 sofa 14.6 8.4 15.5 9.619 train 39.2 38.4 41.64 31.720 tv 28.6 26.4 29.9 27.9

Mean AP 23.60 22.7 25.43 16.9

Table : Domain adaption detection result on validation set of COCO dataset.

Anant Raj (IIT Kanpur) IIT Kanpur July 2, 2015 30 / 39

Page 27: DomainAdaptaon inadeeplearningcontextadas.cvc.uab.es/task-cv2015/materials/Tinne-Talk-TASKCV2015.pdf · Someexperimentalresults 6 Tatiana Tommasi et al. BOWsift % Drop CD DeCAF6 %

Questions ?