unsupervised adaptation - arxiv · tribution mismatch between the training and the test data is...

13
Joint cross-domain classification and subspace learning for unsupervised adaptation Basura Fernando 1 , Tatiana Tommasi 1 , and Tinne Tuytelaars 1 1 ESAT-PSI/VISICS - iMinds, KU Leuven, Kasteelpark Arenberg 10 , B-3001 Leuven, Belgium Abstract Domain adaptation aims at adapting the knowledge acquired on a source domain to a new different but related target domain. Several approaches have been proposed for classification tasks in the unsupervised scenario, where no labeled target data are available. Most of the attention has been dedicated to search- ing a new domain-invariant representation, leaving the definition of the prediction function to a second stage. Here we propose to learn both jointly. Specifi- cally we learn the source subspace that best matches the target subspace while at the same time minimiz- ing a regularized misclassification loss. We provide an alternating optimization technique based on stochas- tic sub-gradient descent to solve the learning prob- lem and we demonstrate its performance on several domain adaptation tasks. Abstract Domain adaptation aims at adapting the knowledge acquired on a source domain to a new different but related target domain. Several approaches have been proposed for classification tasks in the unsupervised scenario, where no labeled target data are available. Most of the attention has been dedicated to search- ing a new domain-invariant representation, leaving the definition of the prediction function to a second stage. Here we propose to learn both jointly. Specifi- 1 Paper is under consideration at Pattern Recognition Let- ters. cally we learn the source subspace that best matches the target subspace while at the same time minimiz- ing a regularized misclassification loss. We provide an alternating optimization technique based on stochas- tic sub-gradient descent to solve the learning prob- lem and we demonstrate its performance on several domain adaptation tasks. 1 Introduction In real world applications, having a probability dis- tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across different text cor- pora [5], localization over time with wifi signal dis- tributions that get easily outdated [42], or biologi- cal models to be used across different subjects [38]. Computer vision methods are also particularly chal- lenged in this respect: real world conditions may alter the image statistics in many complex ways (lighting, pose, background, motion blur etc.), to not even men- tion the difference in quality of the acquisition device (e.g. resolution), or the high number of possible arti- ficial modifications obtained by post-processing (e.g. filtering). Due to this large variability, any learning algorithm trained on a source set regardless of the final target data will most likely produce poor, un- satisfactory results. Domain adaptation techniques propose to over- come these issues and make use of information com- ing from both source and target domains during the learning process. In the unsupervised case, where no 1 arXiv:1411.4491v3 [cs.CV] 29 Apr 2015

Upload: others

Post on 28-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

Joint cross-domain classification and subspace learning for

unsupervised adaptation

Basura Fernando1, Tatiana Tommasi1, and Tinne Tuytelaars1

1ESAT-PSI/VISICS - iMinds, KU Leuven, Kasteelpark Arenberg 10 , B-3001 Leuven,Belgium

Abstract

Domain adaptation aims at adapting the knowledgeacquired on a source domain to a new different butrelated target domain. Several approaches have beenproposed for classification tasks in the unsupervisedscenario, where no labeled target data are available.Most of the attention has been dedicated to search-ing a new domain-invariant representation, leavingthe definition of the prediction function to a secondstage. Here we propose to learn both jointly. Specifi-cally we learn the source subspace that best matchesthe target subspace while at the same time minimiz-ing a regularized misclassification loss. We provide analternating optimization technique based on stochas-tic sub-gradient descent to solve the learning prob-lem and we demonstrate its performance on severaldomain adaptation tasks.

Abstract

Domain adaptation aims at adapting the knowledgeacquired on a source domain to a new different butrelated target domain. Several approaches have beenproposed for classification tasks in the unsupervisedscenario, where no labeled target data are available.Most of the attention has been dedicated to search-ing a new domain-invariant representation, leavingthe definition of the prediction function to a secondstage. Here we propose to learn both jointly. Specifi-

1Paper is under consideration at Pattern Recognition Let-ters.

cally we learn the source subspace that best matchesthe target subspace while at the same time minimiz-ing a regularized misclassification loss. We provide analternating optimization technique based on stochas-tic sub-gradient descent to solve the learning prob-lem and we demonstrate its performance on severaldomain adaptation tasks.

1 Introduction

In real world applications, having a probability dis-tribution mismatch between the training and the testdata is more often the rule than an exception. Thinkabout part of speech tagging across different text cor-pora [5], localization over time with wifi signal dis-tributions that get easily outdated [42], or biologi-cal models to be used across different subjects [38].Computer vision methods are also particularly chal-lenged in this respect: real world conditions may alterthe image statistics in many complex ways (lighting,pose, background, motion blur etc.), to not even men-tion the difference in quality of the acquisition device(e.g. resolution), or the high number of possible arti-ficial modifications obtained by post-processing (e.g.filtering). Due to this large variability, any learningalgorithm trained on a source set regardless of thefinal target data will most likely produce poor, un-satisfactory results.

Domain adaptation techniques propose to over-come these issues and make use of information com-ing from both source and target domains during thelearning process. In the unsupervised case, where no

1

arX

iv:1

411.

4491

v3 [

cs.C

V]

29

Apr

201

5

Page 2: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

labeled samples are provided for the target, the mostextensively studied paradigm consists in assumingthe existence of a domain-invariant feature space andsearching for it. In general all the techniques based onthis idea focus on transforming the representation ofthe source and target samples to maximize some no-tion of similarity between them [14, 15, 12]. Howeverin this way the classification task is left aside and theprediction model is learned only in a second stage.As thoroughly discussed in [2, 29], the choice of thefeature representation able to reduce the domain di-vergence is indeed a crucial factor for the adaptation.Nevertheless it is not the only one. If several repre-sentations induce similar marginal distributions forthe two domains, would a classifier perform equallywell in all of them? Is it enough to encode the la-beling information in the used feature space or is itbetter to learn a cross-domain classification model to-gether with the optimal domain-invariant representa-tion? Here we answer these questions by focusing onunsupervised domain adaptation subspace solutions.We present an algorithm that learns jointly both alow dimensional representation and a reliable clas-sifier by optimizing a trade-off between the source-target similarity and the source training error.

2 Related Work

For classification tasks, the goal of domain adapta-tion is to learn a function from the source domainthat predicts the class label of a novel test samplefrom the target domain [33]. In the literature thereare two main scenarios depending on the availabil-ity of data annotations: the semi-supervised and theunsupervised setting.

In the semi-supervised setting a few labeled sam-ples are provided for the target domain besides alarge amount of annotated source data. Existingsolutions can be divided into classifier-based andrepresentation-based methods. The former modifythe original formulation of Support Vector Machines(SVM) [41, 10] and other statistical classifiers [6]:they adapt a pre-trained model to the target, orlearn the source and target classifiers simultaneously.The latter exploit the correspondence between source

and target labeled data to impose constraints overthe samples through metric learning [25, 34], or con-sider feature augmentation strategies [23] and man-ifold alignment [40]. Some approaches have alsotackled the cases with more than two available do-mains [9, 19] and the unlabeled part of the targethas been used for co-regularization [26]. Recently,two methods proposed to combine classifier-basedand representation-based solutions. [20] introducedan approach to learn jointly a cross-domain classifierand a transformation that maps the target points intothe source domain. Several kernel maps are used toencode the representation in [8], which proposed adomain transfer multiple kernel learning algorithm.

In the more challenging unsupervised setting, allthe available target samples are unlabeled. Manyunsupervised domain adaptive approaches resortto estimating the data distributions and minimiz-ing a distance measure between them, while re-weighting/selecting the samples [38, 13]. The Maxi-mum Mean Discrepancy (MMD) [16] maps two setsof data to a reproducing Kernel Hilbert Space and ithas been largely used as distance measure betweentwo domain distributions. Although endowed withnice properties, the choice of the kernel and kernelparameters are critical and, if non-optimal, can leadto a very poor estimate of the distribution distance[17]. Dictionary learning methods have also beenused with the goal of defining new representationsthat overcome the domain shift [30, 35]. A recon-struction approach was proposed in [24]: the sourcesamples are mapped into an intermediate space whereeach of them can be represented as a linear combina-tion of the target domain samples.

Another promising direction for unsupervised do-main adaptation is that of subspace modeling. Thisis based on the idea that source and target share alatent subspace where the domain shift is removed orreduced. As for dictionary learning, the approachespresented in this framework are mostly linear, butcan be easily extended to non-linear spaces throughexplicit feature mappings [39]. In [4] Canonical Cor-relation Analysis (CCA) has been applied to find acoupled domain-invariant subspace. Principal Com-ponent Analysis (PCA) and other eigenvalue meth-

2

Page 3: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

ods are also widely used for subspace generation. Forinstance, Transfer Component Analysis (TCA, [32])is a dimensionality reduction approach that searchesa latent space where the variance of the data ispreserved as much as possible and the distance be-tween the distributions is reduced. Transfer SubspaceLearning (TSL, [37]) couples PCA and other sub-space learning methods with a Bregman divergence-based regularization which measures the distance be-tween the distribution in the projected space. Alter-natively, the algorithm introduced in [1] uses MMDin the subspace to search for a domain invariant pro-jection matrix. Other methods exploited multipleintermediate subspaces to link the source and thetarget data. This idea was introduced in [15] wherethe path across the domains is defined as a geodesiccurve over a Grassmann manifold. This strategy hasbeen further extended in [14] where all the intermedi-ate subspaces are integrated to define a cross-domainsimilarity measure. Despite the intuitive characteri-zation of the problem, it is not clear why all the sub-spaces along this path should yield meaningful rep-resentations. Recently the Subspace Alignment (SA)method [12] demonstrated that it is possible to mapdirectly the source to the target subspace withoutnecessarily passing through intermediate steps.

Overall, the main focus of the unsupervised meth-ods proposed in the literature is on the domain invari-ance of the final data representation and less atten-tion has been dedicated to its discriminative power.First attempts in this direction have been done in [14]by substituting the use of PCA over the source sub-space with Partial Least Squares (PLS), and in [32]where SSTCA chooses the representation by maxi-mizing its dependence on the data labels. Our workfits in this context. We aim at extending the inte-gration of classifier-based with representation-based solutions in the unsupervised settingwhere no access to the target labels is avail-able, not even for hyperparameter cross vali-dation. Differently from all the described unsuper-vised approaches we go beyond searching only a do-main invariant feature space and we want to optimizealso a cross-domain classification model. We proposean algorithm that combines effectively subspace and

max-margin learning and exploits the source discrim-inative information better than just encoding it in therepresentation. Our approach does not need an esti-mate of the source and target data distributions andrelies on a simple measure of domain shift. Finally, inprevious work the performance of the adaptive meth-ods have been often evaluated by tuning the modelparameters on the target data [12, 24] or by fixingthem to default values [21]. Here we choose a morefair setup for unsupervised domain adaptation andwe show that our approach outperforms different ex-isting subspace adaptive methods by exploiting ex-clusively the source annotations. We name our algo-rithm Joint cross-domain Classification and SubspaceLearning (JCSL).

In the following sections we define the notationthat will be used in the rest of the paper (section3) and we briefly review the theory of learning fromdifferent domains together with the subspace domainshift measure used in [12] from which we took inspi-ration. We then introduce our approach (section 4)followed by an extensive experimental analysis thatshows its effectiveness on several domain adaptationtasks (section 5). We conclude with a final discussionand sketching possible directions for future research(section 6).

3 Problem Setup and Back-ground

Let us consider a classification problem where thedata instances are in the form (xi, yi). Here xi ∈ RD

is the feature vector for the i-th sample and yi ∈{1, . . . ,K} is the corresponding label. We assumethat ns labeled training samples are drawn from asource distribution Ds = P (xs, ys), while a set ofnt unlabeled test samples come from a different tar-get distribution Dt = P (xt, yt), such that it holdsDs 6= Dt. In particular, the source and the tar-get distributions satisfy the covariate shift property[36] if they have the same labeling function withP (ys|xs) = P (yt|xt), while the marginal distribu-tions differ P (xs) 6= P (xt). We operate under thishypothesis.

3

Page 4: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

A bound on the target domain error Theoret-ical studies on domain adaptation have establishedthe conditions under which a classifier trained on thesource data can be expected to perform well on thetarget data. The following generalization bound onthe target error εt has been demonstrated in [2]:

εt(h) ≤ εs(h) + dH(Ds,Dt) + λ . (1)

Here h indicates the predictor function, while H isthe hypothesis class from which the predictor hasbeen chosen. In words, the bound states that a lowtarget error can be guaranteed if the source errorεs(h), a measure of the domain distribution diver-gence dH(Ds,Dt), and the error λ of the ideal jointhypothesis on the two domains are small. The jointerror can be written as λ = εt(h

∗) + εs(h∗) where

h∗ = argminh∈H(εt(h) + εs(h)). The value of λ issupposed to be low under the the covariate shift as-sumption.

A subspace measure of domain shift The low-dimensional intrinsic structure of the source and tar-get domains can be specified by their correspond-ing orthonormal basis sets, indicated respectively asS ∈ RD×d and T ∈ RD×d. These are two full rankmatrices, and d is the subspace dimensionality. In[12], a transformation matrix M is introduced tomodify the source subspace. The domain shift of thetransformed source basis with respect to the target issimply measured by the following function:

F (M) = ||SM − T ||2F , (2)

where ||· ||F is the Frobenius norm. The SubspaceAlignment (SA) method proposed to minimize thismeasure, obtaining the optimal transformation ma-trix in closed form: M = S>T ∈ Rd×d. The matrixU = SM = SS>T ∈ RD×d is finally used to repre-sent the source data. The original domain basis setscan be obtained through different strategies, both un-supervised (PCA) and supervised (PLS, LDA), as ex-tensively studied in [11].

SA has shown promising results for visual cross-domain classification tasks outperforming other sub-space adaptive methods. However, on par with itscompetitors [15, 14], it keeps the domain adaptive

process (learning M) and the classification process(e.g. learning an SVM model) separated, focusingonly on the distribution divergence term dH(Ds,Dt)of the bound in (1).

4 Proposed Approach

With the aim of minimizing both the domain diver-gence and the source error in (1), we propose an algo-rithm that learns a domain-invariant representationand an optimal cross-domain classification model.For the representation we concentrate on subspacemethods and we take inspiration from the SA ap-proach. For the classification we rely on a standardmax-margin formulation. The details of our Jointcross-domain Classification and Subspace Learning(JCSL) algorithm are described below.

Given a fixed target subspace basis T ∈ RD×d weminimize the following regularized risk functional

G(V,w) = ||w||22+α||V −T ||2F +β

ns∑i

L(xsi , y

si ,w, V ) .

(3)Here the regularization terms aim at optimizing sep-arately the linear source classification model w ∈ Rd,and the source representation matrix V ∈ RD×d,while the loss function L depends on their combi-nation. For our analysis we choose the hinge loss:L(xs

i , ysi ,w, V ) = max{0, 1 − xs

i>Vwysi }, but other

loss functions can be used for different cross-domainapplications. The parameters α and β allows to de-fine a trade-off between the importance of the termsin the objective function. In particular a high α valuepushes V towards T giving more importance to thedistribution divergence term, while a high β value fo-cuses the attention on the training error term to im-prove the classification performance in the new space.

The matrix V has a role analogous to that of U inSA, however in our case it is not necessary to specify apriori the source subspace S which is now optimizedtogether with the alignment transformation matrixM in a single step. Note that, if the source and tar-get data can be considered as belonging to the samedomain (no domain shift), our method will automat-ically provide V = T boiling down to standard learn-

4

Page 5: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

ing in the shared subspace. We follow previous litera-ture and propose the use of PCA to define the targetsubspace T [12, 14]. Besides having demonstratedgood results empirically, the theoretical meaning ofthis choice can be identified by writing the mutualinformation between the target and the source as

MI(source; target) = H(target)−KL(source||target) .(4)

Projecting the target data to the subspace T max-imizes the entropy H(target), while our objectivefunction minimizes the domain shift, which is relatedto the Kullback-Leibler divergence KL(· ||· ). Hence,we expect to increase the mutual information be-tween source and target.

Minimizing (3) jointly over (V,w) is a non-convexproblem and finding a global optimum is generallyintractable. However we can apply alternated mini-mization for V and w resulting in two interconnectedconvex problems that can be efficiently solved bystochastic subgradient descent. For this procedurewe need the partial derivatives of (3) that can beeasily calculated as:

∂G(V,w)

∂V= 2α(V − T )− β Σns

i=1 Γi

∂G(V,w)

∂w= 2w − β Σns

i=1 Θi

(5)

where Γ and Θ are the derivatives of L(xsi , y

si ,w, V )

with respect to V and w. When using the hinge losswe get

Γi =

{xsi>wysi

0Θi =

{xsi>V ysi

0if (xs

i>Vwysi ) < 1

otherwise .

The iterative subgradient descent procedure termi-nates when the algorithm converges, showing a negli-gible change of either V or w between two consecutiveiterations. The formulation holds for a binary classi-fier but can easily be used in its one-vs-all multiclassextension that highly benefits from the choice of thestochastic variant of the optimization process.

At test time, we indicate the classification scoreof class y for the target sample xt

i as s(xti ,wy) =

xti>Twy . The multiclass final prediction is then

obtained by maximizing over the scores: y∗i =argmaxy(s(xt

i ,wy)). Note that the source represen-tation matrix V is not involved at this stage, andthe target subspace basis T appears instead. Dif-ferently from the pre-existing unsupervised domainadaptation methods that encode the discriminativeinformation in the representation, JCSL learns di-rectly a domain invariant classification model able togeneralize from source to target. The JCSL learningstrategy is summarized in Algorithm 1.

Algorithm 1 JCSL

Input: step size η and batch size γ for stochasticsub-gradient descentOutput: V ∗,w∗

1: Initialize V ← S,w← 0, k ← 02: while not converged do3: k ← k + 14: calculate the partial derivatives:

∂G(V,w)∂V

= 2α(V − T )− βΣγi=1Γi

with Γi =

{xsi>wysi if (xs

i>Vwysi ) < 1

0 otherwise

∂G(V,w)∂w

= 2w − βΣγi=1Θi

with Θi =

{xsi>V ysi if (xs

i>Vwysi ) < 1

0 otherwise

5: Fix V , identify the optimal w:

wk ← wk−1 − η(∂F (V,w)∂w

)wk−1

6: Fix w, identify the optimal V :

Vk ← Vk−1 − η(∂F (V,w)∂V

)Vk−1

7: end while

5 Experiments

We validate our approach over several domain adap-tation tasks. In the following we first describe ourexperimental setting (section 5.1) and then we re-port on the obtained results (sections 5.2, 5.3, 5.4).Moreover we present a detailed analysis on the roleof the learning parameters and on the domain-shiftreduction effect of JCSL (section 5.2).

5

Page 6: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

Figure 1: Top line: examples from Office+Caltech dataset and the MNIST+USPS dataset. Bottom line:weakly labeled images from Bing dataset.

5.1 Datasets, baselines and imple-mentation details

We choose three image datasets (see Figure 1) and awifi signal dataset.

Office + Caltech [14]. This dataset was createdby combining the Office dataset [34] with Caltech256[18] and it contains images of 10 object classes overfour domains: Amazon, Dslr, Webcam and Caltech.Amazon consists of images from online merchants’catalogues, while Dslr and Webcam domains are com-posed by respectively high and low resolution images.Finally, Caltech corresponds to a subset of the orig-inal Caltech256. We use the features provided byGong et al. [14]: SURF descriptors quantized intohistograms of 800 bag-of-visual words and standard-ized by z-score normalization. All the 12 possiblesource-target domain pairs are considered. We usethe data splits provided by Hoffman et al. [20].

MNIST [27] + USPS [22]. This dataset com-bines two existing image collections of digits present-ing different gray scale data distributions. Specifi-

cally they share 10 classes of digits. We randomly se-lected 1800 images from USPS and 2000 images fromMNIST. By following [28] we uniformly re-scale allimages to size 16 × 16 and we use the L2-normalizedgray-scale pixel values as feature vectors. Both do-mains are alternatively used as source and target.

Bing+Caltech [3]. In this dataset, weakly anno-tated images from the Bing search engine define thesource domain while images of Caltech256 are usedas target. We run experiments varying the numberof categories (5, 10, 15, 20, 25 and 30) and the num-ber of source examples per category (5 and 10) usingthe same train/test split adopted in [3]. As typicallydone for this dataset, Classemes features are used asimage representation [3].

WiFi [42]. This dataset was used in the 2007IEEE ICDM contest for domain adaptation. Thegoal is to estimate the location of mobile devicesbased on the received signal strength (RSS) valuesfrom different access points. The domains correspondto two different time periods during which the col-lected RSS values present different distributions. The

6

Page 7: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

Table 1: Recognition rate (%) results over the Office+Caltech and MNIST+USPS datasets.DA Problem NA PCAT SA(LDA−PCA) GFK(LDA−PCA) TCA SSTCA TSL JCSL

A→C 38.1 ± 2.6 41.1±1.7 43.4 ± 3.2 43.2 ± 3.7 43.5 ± 3.2 38.8±2.4 40.4±0.9 42.6 ± 0.9A→D 32.9 ± 2.8 37.5±1.4 44.7 ± 2.6 43.7 ± 2.8 38.8 ± 2.1 34.1±6.9 40.8±1.7 42.5 ± 3.2A→W 36.8 ± 2.9 39.1±2.8 40.3 ± 2.9 41.3 ± 2.0 41.0 ± 1.0 34.1±3.8 41.1±2.3 47.6 ± 2.1C→A 39.5 ± 1.0 40.6±2.8 39.3 ± 1.8 39.9 ± 1.5 42.5 ± 1.2 39.1±3.5 43.0±4.2 44.3 ± 1.2C→D 38.8 ± 2.4 40.9±1.3 44.0 ± 2.0 42.0 ± 2.5 42.1 ± 2.5 38.3±4.1 40.4±3.4 46.5 ± 1.5C→W 37.8 ± 1.4 35.9±3.2 37.3 ± 3.3 41.8 ± 3.8 39.4 ± 2.5 31.6±4.9 40.9±2.4 46.5 ± 2.0D→A 24.4 ± 1.7 30.6±2.7 35.7 ± 2.3 31.0 ± 2.7 30.4 ± 2.1 38.0±2.6 39.6±1.2 41.3 ± 0.9D→C 30.5 ± 2.1 37.9±1.3 41.5 ± 1.4 40.9 ± 2.8 36.7 ± 2.6 32.9±1.8 33.4±2.1 35.1 ± 0.9D→W 60.9 ± 3.1 67.9±2.8 58.6 ± 2.1 60.5 ± 3.8 64.5 ± 3.2 76.2±3.0 73.7±1.5 74.2 ± 3.6W→A 29.7 ± 1.7 34.1±1.7 34.7 ± 0.9 33.1 ± 1.2 34.6 ± 1.4 35.1±2.6 38.0±1.6 43.1 ± 1.0W→C 34.1 ± 1.6 38.1±1.3 36.7 ± 1.2 37.7 ± 1.8 39.6 ± 2.2 29.7±2.5 30.4±1.3 36.1 ± 2.0W→D 71.1 ± 2.6 74.6±0.9 69.5 ± 2.4 75.3 ± 2.6 77.3 ± 2.7 69.9±3.4 66.9±1.6 66.2 ± 2.9AVG. 39.6 43.2 43.8 44.2 44.2 41.4 44.0 47.2

MNIST→USPS 45.4 45.1 48.6 34.6 40.8 40.6 43.5 46.7USPS→MNIST 33.3 33.4 22.2 22.6 27.4 22.2 34.1 35.5

AVG. 39.4 39.2 35.4 28.6 34.1 31.4 38.8 41.1

dataset contains 621 labeled examples collected dur-ing time period A (source) and 3128 unlabeled exam-ples collected during time period B (target). The lo-cation recognition performance is generally evaluatedby measuring the average error distance between thepredicted and the correct space position of the mo-bile devices. We slightly modify the task to define aclassification rather than a regression problem. Weconsider 247 locations and we evaluate the classifi-cation performance between the sets A and B withand without domain adaptation. We repeat the ex-periments both testing over all the target data andconsidering 10 random target splits, each with 400random samples.

We benchmark JCSL 1 against the followingsubspace-based domain adaptation methods:

TCA, SSTCA: Transfer Component Analysisand its semi-supervised extension [32]. We imple-mented TCA and SSTCA by following the originalpaper description. For SSTCA we turned off thelocality preserving option to have a fair comparisonwith all the other considered methods, none of whichexploits local geometry2.

TSL: Transfer Subspace Learning [37]. Weused the code made publicly available by the au-

1We implemented our algorithm in MATLAB. The code issubmitted with the paper as supplementary material.

2Following the original paper notation we fixed µ = 0.1,λ = 0, γ = 0.5.

thors3 which implements TLS by adding a Bregman-divergence based regularization to the Fisher’s LinearDiscriminant Analysis (FLDA).

GFK(LDA−PCA), SA(LDA−PCA): for both theGeodesic Flow Kernel [14] and the Subspace Align-ment [12] methods we slightly modified the originalimplementation provided by the authors4 to inte-grate the available discrimintive information in thesource domain. As preliminary evaluation we com-pared the results of GFK and SA when the basisof the source subspace were obtained with PLS andLDA. Although performing similarly on average, PLSshowed less stability than LDA with large changes inthe outcome for small variations of the subspace di-mensionality d. This can be explained by consideringthe difficulty of finding the best d that jointly maxi-mizes the source data/label coherence and minimizesthe source/target shift. Thus, for our experimentswe rely on the more stable LDA for the source whichfixes d equal K − 1. On the other hand, the targetsubspace is always obtained by applying PCA andselecting the first K − 1 eigenvectors.

As further baselines we also consider the sourceclassifier learned with no adaptation (NA) in theoriginal feature space and in the target subspace. Thelast one is obtained by applying PCA on the target

3http://www.cs.utexas.edu/~ssi/TrFLDA.tar.gz4 http://www-scf.usc.edu/~boqinggo/domain_

adaptation/GFK_v1.zip, http://homes.esat.kuleuven.

be/~bfernand/DA_SA/downloadit.php?fn=DA_SA.zip

7

Page 8: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

domain and using the eigenvectors as basis to repre-sent both the source and the target data (PCAT ).

For all the methods the final classifier is a lin-ear SVM with the C parameter tuned by two-fold cross-validation on the source over the range{0.001, 0.01, 0.1, 1.0, 10}. Our JCSL has three mainparameters (α, β, d) that are also chosen by two-fold cross validation on the source. We remark thatthe target data are not annotated, thus tuning theparameters on the source is the only feasible op-tion. We searched for α, β in the same range indi-cated before for C. The parameter d was tuned in{10, 20, . . . , 100} both for JCSL and for the baselinesPCAT , TSL, TCA and SSTCA.

We implemented the stochastic sub-gradient de-scent using a step size of η = 0.1 and a batch sizeof γ = 10. The alternating optimization convergesfor less than 100 iterations and we can obtain theresults for any of the source-target domain pairs ofthe Office+Caltech (excluding feature extraction) in2 minutes using a modern desktop computer (2.8GHzcpu, 4Gb of ram, 1 core). With respect to the con-sidered competing methods, the training phase ofJCSL is slower (e.g. 60 times slower with respectto SA and GFK), but we remark that JCSL providesan optimized cross-domain classifier besides reducingthe data distribution shift. The test phase runtimeis comparable for all the considered approaches. Inpractical applications domain adaptation models areusually learned offline, thus the training time is a mi-nor issue.

5.2 Results - Office+Caltech andMNIST+USPS

The obtained results over the Office+Caltech andMNIST+USPS datasets are presented in Table 1.Overall JCSL outperforms the considered baselinesin 7 source-target pairs out of 14 and shows the bestaverage results over the two datasets. Thus, we canstate that, minimizing a trade-off between source-target similarity and the source classification errorpays off compared to only reducing the cross-domainrepresentation divergence. Still SA shows an advan-tage with respect to JCSL in a few of the considered

cases most probably because it can exploit the dis-criminative LDA subspace. With respect to JCSL, TCA and SSTCA seem to work particularly wellwhen the domain shift is small (e.g. Amazon → Cal-tech, Dslr → Webcam). Interestingly JCSL is theonly method that consistently outperforms NA overMNIST+USPS.

Parameter analysis To better understand theperformance of JCSL we analyze how the target ac-curacy varies with respect to the source accuracywhile changing the learning parameters α, β and d.The plots in Figure 2 consider four domain adapta-tion problems, namely (Amazon → Caltech), (Ama-zon → Webcam), (MNIST → USPS) and (USPS →MNIST)5. All of them present two main clusters. Onthe left, when the source accuracy is low, the tar-get accuracy is uniformly distributed. This behaviormostly appears when β is very small and α has a highvalue: this indicates that minimizing only ||V − T ||2Fdoes not guarantee stable results on the target task.On the other hand, in the second cluster the sourceaccuracy is highly correlated with the target accu-racy. On average, for the points in this region, boththe domain divergence term and the misclassificationloss obtain low values. The final JCSL result withthe optimal (V ∗,w∗) appears always in this area andthe dimensionality of the subspace d seems to haveonly a moderate influence on the final results. Thered line reported on the plots is obtained by least-square fitting over the source and target accuraciesand presents an analogous trend for all the consid-ered source-target pairs. This is an indication thatwhen domains are adaptable (negligible λ in (1)) ourmethod is able to find a good source representationas well as a classifier that generalizes to the targetdomain.

Measuring the domain shift For the same do-main pairs considered above we also evaluate empir-ically the H∆H divergence measure defined in [2].This is obtained by learning a linear SVM that dis-criminates between the source and target instances,

5Analogous results are obtained for all the remainingsource-target pairs.

8

Page 9: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

0 10 20 30 40 50 60 700

5

10

15

20

25

30

35

40

45

Source accuracy (%)

Target accuracy (%)

0 10 20 30 40 50 60 700

10

20

30

40

50

Source accuracy (%)

Target accuracy (%)

0 20 40 60 800

10

20

30

40

50

60

Source accuracy (%)

Target accuracy (%)

0 20 40 60 800

10

20

30

40

50

Source accuracy (%)

Target accuracy (%)

Figure 2: Target accuracy vs source accuracy over domain adaptation problem (Amazon→ Caltech), (Ama-zon → Webcam), (MNIST → USPS) and (USPS → MNIST) obtained by using JCSL and changing theparameters α, β and d. In all the cases the top right point cluster shows the high correlation between thesource and target accuracy. By comparing the x- and y- axis values of this cluster it is also evident thesource-to-target performance drop with respect to the source-to-source result in each experiment. The redsquare indicates the result selected by our method for the considered split. The red line is obtained byleast-square fitting and makes it evident the trend in the results shared by all the source-target pairs.

5 10 15 20 25 30

15

20

25

30

35

40

45

50

55

Number of classes

Targ

et

Ac

cu

rac

y (

%)

No AdaptationPCA

T

SA

GFK

TCA

TSL

JCSL

5 10 15 20 25 30

20

25

30

35

40

45

50

55

60

Number of classes

Targ

et

Ac

cu

rac

y (

%)

No AdaptationPCA

T

SA

GFK

TCA

TSL

JCSL

Figure 3: Experimental results on Bing+Caltech obtained when using 5 (left) and 10 (right) samples perclass in the source. SSTCA has shown similar or worse results than TCA, so we did not include it in thisevaluation to avoid further clutter in the plot.

respectively pseudo-labeled with +1 and −1. We sep-arated each domain into two halves and use them fortraining and test when learning a linear SVM model.A high final accuracy indicates high domain diver-gence. We perform this analysis by comparing thedomain shift before and after the application of SAand JCSL, according to their standard settings. SApresents a single step and learns one subspace rep-resentation U . JCSL exploits a one-vs-all procedurelearning as many Vy as the number of classes: eachstep involves all the data (no per class sample selec-tion). The final domain shift for JCSL is the averageover the obtained separate shift values. The results

Table 2: H∆H analysis. Lower values indicate lowercross-domain distribution discrepancy.

Space A→C A→W MNIST→USPS USPS→MNIST

Original features 74.82 90.18 100.00 100.00SA(LDA-PCA) 65.96 56.56 55.78 55.74

JCSL 65.76 54.97 57.03 53.28

in Table 2 indicate that SA and JCSL produce com-parable results in terms of domain-shift reduction,suggesting that the main advantage of JCSL comesfrom the learned classifier.

9

Page 10: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

5.3 Results - Bing+Caltech

Due to the way in which it was defined, Bing+Caltechcan be considered as a much more challenging testbedfor unsupervised domain adaptation compared to theother used datasets (see also Figure 1). At the sametime it also corresponds to one of the most realis-tic scenarios where domain adaptation is needed: wehave access to only a limited number of noisy labeledsource images obtained from the web and we wantto use them to classify over a curated collection ofobject images. For this problem exploiting at thebest all the available information is crucial. Specifi-cally, since the source is not fully reliable, coding itsdiscriminative information in the representation (e.g.through LDA or PLS) may be misleading. On theother hand, using the subspace of the non-noisy tar-get data to guide the learning process can be muchmore beneficial.

As shown in Figure 3, JCSL is the only methodthat consistently improves over the non-adaptive ap-proach independently from the number of consideredclasses. TSL is always equivalent to NA, while theother subspace methods, although initially helpful forproblems with few classes, lose their advantage overNA when the number of classes increases. This be-havior is almost equivalent when using both 5 and 10source samples per class.

5.4 Results - WiFi Localization

To demonstrate the generality of the proposed al-gorithm, we evaluate JCSL also on non-visual data.Since the WiFi vector dimensionality (100) is lowerthan the number of classes (247), we do not exploitLDA here but we simply apply PCA to define thesubspace dimensionality for both the source and tar-get domains. The results on the WiFi-localizationtask are reported in Table 3 and show that domainadaptation is clearly beneficial. TCA and SSTCAare the state of the art linear methods on the WiFidataset and they confirm their value even in theconsidered classification setting by outperforming SAand GFK. Still JCSL presents the best results. Theobtained classification accuracy confirms the valueof our method over the other subspace-based tech-

niques.

6 Conclusions

Motivated by the theoretical results of Ben-Davidet al. [2], in this paper we proposed to integratethe learning process of the source prediction functionwith the optimization of the invariant subspace forunsupervised domain adaptation. Specifically, JCSLlearns a representation that minimizes the divergencebetween the source subspace and the target subspace,while optimizing the classification model. Extensiveexperimental results have shown that, by taking ad-vantage of the described principled combination andwithout the need of passing through the evaluation ofthe data distributions, JCSL outperform other sub-space domain adaptation methods that focus only onthe representation part.

Recently several works have demonstrated thatConvolutional Neural Network classifiers are robustto domain shift [7, 31]. Reasoning at high level wecan identify the cause of such a robustness on thesame idea at the basis of JCSL : deep architectureslearn jointly a discriminative representation and theprediction function. The highly non-linear transfor-mation of the original data coded into the CNN acti-vation values can also be used as input data descrip-tors for JCSL with the aim of obtaining a combinedeffect. As future work we plan to evaluate principledways to find automatically the best subspace dimen-sionality d using low-rank optimization methods.

Acknowledgments

The authors acknowledge the support of the FP7 ECproject AXES and of the FP7 ERC Starting Grant240530 COGNIMUND.

References

[1] Mahsa Baktashmotlagh, Mehrtash Harandi,Brian Lovell, and Mathieu Salzmann. Unsu-pervised domain adaptation by domain invari-

10

Page 11: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

Table 3: Classification accuracy obtained over WiFi localization dataset [42]. The full -row contains theresults over the whole target set. In the split-row we present the results obtained over 10 splits of the target,each containing 400 randomly extracted samples.

NA SA(LDA-PCA) GFK(LDA-PCA) TCA SSTCA JCSLfull 16.6 17.3 17.3 19.0 18.5 20.2splits 16.9 ± 2.1 17.6 ± 2.2 17.4 ± 2.1 19.2 ± 2.1 18.0 ± 2.2 20.5 ± 2.4

ant projection. In International Conference onComputer Vision (ICCV), 2013.

[2] Shai Ben-David, John Blitzer, Koby Crammer,and Fernando Pereira. Analysis of representa-tions for domain adaptation. In Neural Infor-mation Processing Systems (NIPS), 2007.

[3] Alessandro Bergamo and Lorenzo Torresani. Ex-ploiting weakly-labeled web images to improveobject classification: a domain adaptation ap-proach. In Neural Information Processing Sys-tems (NIPS), 2010.

[4] John Blitzer, Dean Foster, and Sham Kakade.Domain adaptation with coupled subspaces. InInternational Conference on Artificial Intelli-gence and Statistics (AISTATS), 2011.

[5] John Blitzer, Ryan McDonald, and FernandoPereira. Domain adaptation with structural cor-respondence learning. In Empirical Methods inNatural Language Processing (EMNLP), 2006.

[6] Hal Daume, III and Daniel Marcu. Domainadaptation for statistical classifiers. J. Artif. Int.Res., 26(1):101–126, May 2006.

[7] Jeff Donahue, Yangqing Jia, Oriol Vinyals, JudyHoffman, Ning Zhang, Eric Tzeng, and TrevorDarrell. Decaf: A deep convolutional activationfeature for generic visual recognition. In Interna-tional Conference in Machine Learning (ICML),2014.

[8] Lixin Duan, Ivor W. Tsang, and DongXu. Domain transfer multiple kernel learn-ing. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), 34(3):465–479, March 2012.

[9] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiplesources via auxiliary classifiers. In InternationalConference in Machine Learning (ICML), 2009.

[10] Lixin Duan, Ivor W. Tsang, Dong Xu, andStephen J. Maybank. Domain transfer svm forvideo concept detection. In Computer Visionand Pattern Recognition (CVPR), 2009.

[11] Basura Fernando, Amaury Habrard, Marc Seb-ban, and Tinne Tuytelaars. Subspace alignmentfor domain adaptation. CoRR, abs/1409.5241,2014.

[12] Basura Fernando, Amaury Habrard, Marc Seb-ban, Tinne Tuytelaars, et al. Unsupervised vi-sual domain adaptation using subspace align-ment. In International Conference on ComputerVision (ICCV), 2013.

[13] B. Gong, K. Grauman, and F. Sha. Connect-ing the dots with landmarks: Discriminativelylearning domain-invariant features for unsuper-vised domain adaptation. In International Con-ference in Machine Learning (ICML), 2013.

[14] Boqing Gong, Yuan Shi, Fei Sha, and K. Grau-man. Geodesic flow kernel for unsupervised do-main adaptation. In Computer Vision and Pat-tern Recognition (CVPR), 2012.

[15] R. Gopalan, Ruonan Li, and R. Chellappa. Do-main adaptation for object recognition: An un-supervised approach. In International Confer-ence on Computer Vision (ICCV), 2011.

[16] Arthur Gretton, Karsten Borgwardt, MalteRasch, Bernhard Schlkopf, and AlexanderSmola. A kernel method for the two sampleproblem. In Neural Information Processing Sys-tems (NIPS), 2007.

11

Page 12: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

[17] Arthur Gretton, Karsten M. Borgwardt,Malte J. Rasch, Bernhard Scholkopf, andAlexander Smola. A kernel two-sample test. J.Mach. Learn. Res., 13(1):723–773, 2012.

[18] Gregory Griffin, Alex Holub, and Pietro Perona.Caltech-256 object category dataset. Technicalreport, California Institute of Technology, 2007.

[19] Judy Hoffman, Brian Kulis, Trevor Darrell, andKate Saenko. Discovering latent domains formultisource domain adaptation. In EuropeanConference on Computer Vision (ECCV), 2012.

[20] Judy Hoffman, Erik Rodner, Jeff Donahue, KateSaenko, and Trevor Darrell. Efficient learning ofdomain-invariant image representations. In In-ternational Conference on Learning Representa-tions (ICLR), 2013.

[21] Judy Hoffman, Eric Tzeng, Jeff Donahue,Yangqing Jia, Kate Saenko, and Trevor Darrell.One-shot adaptation of supervised deep convolu-tional models. arXiv preprint arXiv:1312.6204,0:1–8, 2013.

[22] J.J. Hull. A database for handwritten text recog-nition research. IEEE Transactions on Pat-tern Analysis and Machine Intelligence (PAMI),16(5):550–554, May 1994.

[23] Hal Daume III. Frustratingly easy domain adap-tation. In Association for Computational Lin-guistics (ACL), 2007.

[24] I-Hong Jhuo, Dong Liu, D. T. Lee, and Shih-FuChang. Robust visual domain adaptation withlow-rank reconstruction. In Computer Visionand Pattern Recognition (CVPR), pages 2168–2175, 2012.

[25] B. Kulis, K. Saenko, and T. Darrell. What yousaw is not what you get: Domain adaptation us-ing asymmetric kernel transforms. In ComputerVision and Pattern Recognition (CVPR), 2011.

[26] Abhishek Kumar, Avishek Saha, and HalDaume. Co-regularization based semi-supervised domain adaptation. In Neural

Information Processing Systems (NIPS), pages478–486, 2010.

[27] Yann LeCun, Leon Bottou, Yoshua Bengio, andPatrick Haffner. Gradient-based learning ap-plied to document recognition. In S. Haykin andB. Kosko, editors, Intelligent Signal Processing,pages 306–351. IEEE Press, 2001.

[28] Mingsheng Long, Jianmin Wang, GuiguangDing, Jiaguang Sun, and P.S. Yu. Transfer fea-ture learning with joint distribution adaptation.In International Conference on Computer Vi-sion (ICCV), 2013.

[29] Yishay Mansour, Mehryar Mohri, and AfshinRostamizadeh. Domain adaptation: Learningbounds and algorithms. In Conference on Learn-ing Theory (COLT), 2009.

[30] Jie Ni, Qiang Qiu, and Rama Chellappa. Sub-space interpolation via dictionary learning forunsupervised domain adaptation. In ComputerVision and Pattern Recognition (CVPR), June2013.

[31] M. Oquab, L. Bottou, I. Laptev, and J. Sivic.Learning and transferring mid-level image repre-sentations using convolutional neural networks.In Computer Vision and Pattern Recognition(CVPR), 2014.

[32] Sinno Jialin Pan, Ivor W. Tsang, James T.Kwok, and Qiang Yang. Domain adaptation viatransfer component analysis. IEEE Transactionson Neural Networks, 22(2):199–210, 2011.

[33] Vishal M Patel, Raghuraman Gopalan, RuonanLi, and Rama Chellappa. Visual domain adapta-tion: A survey of recent advances. IEEE SignalProcessing Magazine, 2014.

[34] Kate Saenko, Brian Kulis, Mario Fritz, andTrevor Darrell. Adapting visual category mod-els to new domains. In European Conference onComputer Vison (ECCV), 2010.

[35] Sumit Shekhar, Vishal M. Patel, Hien V.Nguyen, and Rama Chellappa. Generalized

12

Page 13: unsupervised adaptation - arXiv · tribution mismatch between the training and the test data is more often the rule than an exception. Think about part of speech tagging across di

domain-adaptive dictionaries. In Computer Vi-sion and Pattern Recognition (CVPR), 2013.

[36] Hidetoshi Shimodaira. Improving predictive in-ference under covariate shift by weighting thelog-likelihood function. Journal of StatisticalPlanning and Inference, 90(2):227–244, October2000.

[37] Si Si, Dacheng Tao, and Bo Geng. Bregmandivergence-based regularization for transfer sub-space learning. IEEE Transactions on Knowl-edge and Data Engineering (T-KDE), 22:929 –942, 2010.

[38] Qian Sun, Rita Chattopadhyay, SethuramanPanchanathan, and Jieping Ye. A two-stageweighting framework for multi-source domainadaptation. In Neural Information ProcessingSystems (NIPS), 2011.

[39] A. Vedaldi and A. Zisserman. Efficient additivekernels via explicit feature maps. IEEE Trans-actions on Pattern Analysis and Machine Intel-ligence (PAMI), 34(3), 2011.

[40] Chang Wang and Sridhar Mahadevan. Heteroge-neous domain adaptation using manifold align-ment. In International Joint Conference on Ar-tificial Intelligence (IJCAI), 2011.

[41] Jun Yang, Rong Yan, and Alexander G. Haupt-mann. Cross-domain video concept detectionusing adaptive svms. In International Con-ference on Multimedia (ACM-MULTIMEDIA),pages 188–197, 2007.

[42] Qiang Yang, Sinno Jialin Pan, and Vin-cent Wenchen Zheng. Estimating location us-ing wi-fi. IEEE Intelligent Systems, 23(1):8–13,2008.

13