adaptive margin diversity regularizer for handling data ... · reported improved performance for...

Adaptive Margin Diversity Regularizer forhandling Data Imbalance in Zero-Shot SBIR

Titir Dutta, Anurag Singh, and Soma Biswas

Indian Institute of Science, Bangalore, India{titird, anuragsingh2, somabiswas}@iisc.ac.in

Abstract. Data from new categories are continuously being discov-ered, which has sparked significant amount of research in developingapproaches which generalizes to previously unseen categories, i.e. zero-shot setting. Zero-shot sketch-based image retrieval (ZS-SBIR) is onesuch problem in the context of cross-domain retrieval, which has receivedlot of attention due to its various real-life applications. Since most real-world training data have a fair amount of imbalance; in this work, for thefirst time in literature, we extensively study the effect of training dataimbalance on the generalization to unseen categories, with ZS-SBIR asthe application area. We evaluate several state-of-the-art data imbalancemitigating techniques and analyze their results. Furthermore, we proposea novel framework AMDReg (Adaptive Margin Diversity Regularizer),which ensures that the embeddings of the sketch and images in the latentspace are not only semantically meaningful, but they are also separatedaccording to their class-representations in the training set. The proposedapproach is model-independent, and it can be incorporated seamlesslywith several state-of-the-art ZS-SBIR methods to improve their perfor-mance under imbalanced condition. Extensive experiments and analysisjustifies the effectiveness of the proposed AMDReg for mitigating theeffect of data imbalance for generalization to unseen classes in ZS-SBIR.

1 Introduction

Sketch-based image retrieval (SBIR) [15][35], which deals with retrieving naturalimages, given a hand-drawn sketch query, has gained significant traction becauseof its potential applications in e-commerce, forensics, etc. Since new categories ofdata are continuously being added to the system, it is important for algorithmsto generalize well to unseen classes, which is termed as Zero-Shot Sketch-BasedImage Retrieval (ZS-SBIR) [6][5][16][7]. Majority of ZS-SBIR approaches learna shared latent-space representation for both sketch and image, where sketchesand images from same category come closer to each other and also incorporateadditional techniques to facilitate generalization to unseen classes.

One important factor that has been largely overlooked in this task of general-ization to unseen classes is the distribution of the training data. Real-world data,used to train the model, is not always class-wise or domain-wise well-balanced.When training and test categories are same, as expected, class imbalance in

2 T. Dutta et al.

the training data results in severe degradation in testing performance, speciallyfor the minority classes. Many seminal approaches have been proposed to mit-igate this effect for the task of image classification [14][11][2][4], but the effectof data imbalance on generalization to unseen classes is relatively unexplored,both for single and cross-domain applications. In fact, both of the two large-scaledatasets, widely used for SBIR/ ZS-SBIR, namely Sketchy Extended [25] andTU-Berlin Extended [8] have data imbalance. In cross-domain data, there canbe two types of imbalance: 1) domain imbalance - where the number of datasamples in one domain is significantly different compared to the other domain;2) class imbalance - where there is a significant difference in the number of datasamples per class. TU-Berlin Ext. exhibits imbalance of both types. Although arecent paper [5] has attributed poor retrieval performance for TU-Berlin Ext. todata imbalance, no measures have been proposed to handle this.

Here, we aim to study the effect of class imbalance in the training data on theretrieval performance of unseen classes in the context of ZS-SBIR, but interest-ingly we observe that the proposed framework works well even when both types ofimbalances are present. We analyze several state-of-the-art approaches for miti-gating the effect of training data imbalance on the final retrieval performance. Tothis end, we propose a novel regularizer termed AMDReg - Adaptive MarginDiversity Regularizer, which ensures that the embeddings of the data samplesin the latent space account for the distribution of classes in the training set. Tofacilitate generalization to unseen classes for ZS-SBIR, majority of the ZS-SBIRapproaches impose a direct or indirect semantic constraint on the latent-spacewhich ensures that the sketch and image samples from unseen classes duringtesting are embedded in the neighborhood of its related seen classes. But merelyimposing a semantic constraint does not account for the training class imbal-ance. The proposed AMDReg, which is computed from the class-wise trainingdata distribution present in sketch and image domains helps to appropriatelyposition the semantic embeddings. It tries to enforce a broader margin / spreadfor the classes for which less number of training samples are available as com-pared to the classes which have larger number of samples. Extensive analysis andevaluation on two benchmark datasets validate the effectiveness of the proposedapproach. The contributions of this paper have been summarized below.

1. We analyze the effect of class-imbalance on generalization to unseen classesfor the ZS-SBIR task. To the best of our knowledge, this is the first workin literature which addresses the data-imbalance problem in the context ofcross-domain retrieval.

2. We analyze the performance of several state-of-the-art techniques for han-dling data imbalance problem for this task.

3. We propose a novel regularizer termed AMDReg, which can seamlessly beused with several ZS-SBIR methods to improve their performance. We haveobserved significant improvement in the performance of three state-of-the-artZS-SBIR methods.

4. We obtain state-of-the-art performance for ZS-SBIR and generalized ZS-SBIR for two large-scale benchmark datasets.

AMDReg for ZS-SBIR 3

2 Related Work

Here, we discuss relevant work in the literature for this study. We include recentpapers for sketch-based image retrieval (SBIR), zero-shot sketch-based imageretrieval (ZS-SBIR), as well as the class-imbalanced problems in classification.

Sketch-based Image Retrieval (SBIR): The primary goal of these approachesis to bridge the domain-gap between natural images and hand-drawn sketches.Early methods for SBIR, such as HOG [12], LKS [24] aim to extract hand-crafted features from the sketches as well as from the edge-maps obtained fromnatural images, which are then directly used for retrieval. The advent of deepnetworks have advanced the state-of-the-art significantly. Siamese network [22]with triplet-loss or contrastive-loss, GoogleNet [25] with triplet loss, etc. are someof the initial architectures. Recently a number of hashing-based methods, suchas [15][35] have achieved significant success. [15] uses a heterogeneous network,which employs the edge maps from images, along with the sketch-image train-ing data to learn a shared representation space. In contrast, GDH [35] exploitsa generative model to learn the equivalent image representation from a givensketch and performs the final retrieval in the image space.Zero-shot Sketch-based Image Retrieval (ZS-SBIR): The knowledge-gapencountered by the retrieval model, when a sketch query or database image isfrom a previously unseen class makes ZS-SBIR extremely challenging. ZSIH [26],generative-model based ZS-SBIR [32] are some of the pioneering works in thisdirection. However, as identified by [6], ZSIH [26] requires a fusion-layer forlearning the model, which shoots up the learning cost and [32] requires strictlypaired sketch-image data for training. Some of the recent works, [5][6][7][16] havereported improved performance for ZS-SBIR over the early techniques. [6] intro-duces a further generalization in the evaluation protocol for ZS-SBIR, termed asgeneralized ZS-SBIR; where the search set contains images from both the sets ofseen and unseen classes. This poses even greater challenge to the algorithm, andthe performances degrade significantly for this evaluation protocol [6][7]. Few ofthe ZS-SBIR approaches are discussed in more details later.Handling data Imbalance for Classification: Since real-world training dataare often imbalanced, recently, a number of works [14][11][2][4] have been pro-posed to address this problem. [14] mitigates the problem of foreground back-ground class imbalance problem in the context of object recognition and proposesa modification to the traditional cross-entropy based classification loss. [4] in-troduces an additional cost-sensitive term to be included with any classificationloss, designed on the basis of effective number of samples in a particular class. [2]and [11] both propose a modification in the margin of the class-boundary learnedvia minimizing intra-class variations and maximizing inter-class margin. [17] dis-cusses a dynamic meta-embedding technique to address classification problemunder long-tailed training data scenario.

Equipped with the knowledge of recent algorithms for both ZS-SBIR andsingle domain class imbalance mitigating techniques, we now move forward todiscuss the problem of imbalanced training data for cross-domain retrieval.

4 T. Dutta et al.

3 Does Imbalanced Training Data Effect ZS-SBIR?

First, we analyze what is the effect of training data imbalance on generaliza-tion to unseen classes in the context of ZS-SBIR. Here, for ease of analysis, weconsider only class imbalance, but our approach is effective for the mixed imbal-ance too, as justified by the experimental results later. Since both the standarddatasets for this task, namely Sketchy Ext. [25] and TU-Berlin Ext. [8] are al-ready imbalanced, to systematically study the effect of imbalance, we createa smaller balanced dataset, which is a subset of Sketchy Ext. dataset. This istermed as mini-Sketchy Dataset and contains sketches and images from 60classes, with 500 images and sketches per class. Among them, randomly selected10 classes are used as unseen classes and the rest 50 classes are used for training.

To study the effect of imbalance, motivated by the class-imbalance literaturein image classification [14][11], we introduce two different types of class imbal-ance: 1) Step imbalance - where few of the classes in the training set contains lessamount of samples compared to other classes; 2) Long-tailed imbalance - wherethe number of samples across the classes decrease gradually following the rule,

nltk = nkµk

Cseen−1 ; where nltk is the available samples for kth class under long-tailed distribution and nk is the number of original samples of that class (=500here). Here, k ∈ {1, 2, ..., Cseen}, i.e. Cseen is the number of training classes andµ = 1

p . We define imbalance factor p for a particular data-distribution to bethe ratio of the highest number of samples in any class to the lowest numberof samples in any class in that data and higher value of p implies more severetraining class imbalance. Since the analysis is with class-imbalance, we assumethat the data samples in image and sketch domain is the same.

As mentioned earlier, the proposed regularizer is generic and can be usedwith several baseline approaches to improve their performance in presence ofdata imbalance. For this analysis, we choose one recent auto-encoder based ap-proach [7]. We term this as Baseline Model for this discussion, since the analysisis equally applicable for other approaches as well. We systematically introduceboth the step and long-tailed imbalances for two different values of p and observethe performance for each of them. The results are reported in Table 1.

As compared to the balanced setting, we observe significant degradation inperformance of the baseline whenever any kind of imbalance is present in thetraining data. This implies that training data imbalance not only effects thetest performance when the classes remain the same, it also adversely effectsthe generalization performance significantly. This is due to the fact that unseenclasses are recognized by embedding them close to their semantically relevantseen classes. Data imbalance results in (1) latent embedding space which is notsufficiently discriminative and (2) improperly learnt embedding functions, bothof which negatively affects the embeddings of the unseen classes. The goal ofthe proposed AMDReg is to mitigate these limitations, which in turn will helpin better generalization to unseen classes (Table 1 bottom row). Thus we see,that if the imbalance is handled properly, it may reduce the need for collectinglarge-scale balanced training samples.


Table 1. Evaluation (MAP@200) of Baseline Model [7] for ZS SBIR on mini-Sketchydataset. Results for long-tailed and step imbalance with different imbalance factors arereported. The final performance using the proposed AMDReg is also compared.

Experimental Protocol Balanced dataImbalanced Data

Long-tailed Stepp = 10 p = 100 p = 10 p = 100

Baseline [7] 0.395 0.234 0.185 0.241 0.156Baseline [7] + AMDReg 0.332 0.240 0.315 0.218

4 Proposed Approach

Here, we describe the proposed Adaptive Margin Diversity Regularizer (AM-DReg), which when used with existing ZS-SBIR approaches can help to mitigatethe adverse effect of training data imbalance. We observe that majority of thestate-of-the-art ZS-SBIR [6][16][7] approaches have two objectives: (1) projectingthe sketches and images to a common discriminative latent space, where retrievalcan be performed; (2) to ensure that the latent space is semantically meaningfulso that the approach generalizes to unseen classes. For the first objective, a clas-sification loss is used while learning the shared latent-space, which constraintsthe latent-space embeddings of both sketches and images from same classes tobe clustered together, and samples from different classes to be well-separated.For the second objective, different direct or indirect techniques are utilized tomake the embeddings semantically meaningful to ensure better generalization.

Semantically Meaningful Class Prototypes: Without loss of generality,we again chose the same baseline [7] to explain how to incorporate the proposedAMDReg into an existing ZS-SBIR approach. Let us consider that there areCseen number of classes present in the dataset, and d is the latent space di-mension. The baseline model has two parallel branches Fim(θim) and Fsk(θsk)for extracting features from images and sketches, {f (m)}, where m ∈ {im, sk},respectively. These features are then passed through corresponding content en-coder networks to learn the shared latent-space embeddings for the same, i.e.z(m) = Em(f (m)). In [7], a distance-based cross-entropy loss is used to learnthese latent embeddings such that the embeddings is close to the semantic in-formation. As is widely used, the class-name embeddings h(y) of the seen-classlabels y ∈ {1, 2, ..., Cseen} are used as the semantic information. These embed-dings are extracted from a pre-trained language model, such as, word2vec [18]or GloVe [20]. Please refer to Fig. 1 for illustration of the proposed AMDRegwith respect to this baseline model.

The last fully connected (fc) layer of the encoders is essentially the classifi-cation layer and the weights of this layer, P = [p1,p2, ...,pCseen

],pi ∈ Rd canbe considered as the shared class-prototypes or the representatives of the corre-sponding class [21]. To ensure a semantically meaningful latent representation,one can learn the prototypes (pi’s) such that they are close to the class-name

6 T. Dutta et al.

embeddings, or the prototypes can themselves be set equal to the semantic em-beddings, i.e. pi = h(y) and kept fixed. If the training data is imbalanced,just ensuring that the prototypes are semantically meaningful is not sufficient,we should also ensure that they take into account the label distribution of thetraining data. In our modification, to be able to adjust the prototypes properly,instead of fixing them as the class-embeddings, we initialize them using these at-

tributes. Since the output of this fc layer is given by z(m) = [z(m)1 , z

(m)2 , ..., z

(m)Cseen

];the encoder with the prototypes is learnt using standard cross-entropy loss as,

LCE(z(m), y) = −logexp(z

(m)y )

Cseen∑j=1

exp(z(m)j )

(1)

Now, with this as the background, we will describe the proposed regularizer,AMDReg, which ensures that the prototypes are modified in such a way thatthey are spread out according to their class representation in the training set.

Adaptive Margin Diversity Regularizer: Our proposed AMDReg is in-spired from the recently proposed Diversity Regularizer [11], which addressesdata imbalance in image classification by adjusting the classifier weights (hereprototypes) so that they are uniformly spread out in the feature space. In ourcontext, it can be enforced by the following regularizer

R(P) =1

Cseen

∑i<j

[||pi − pj ||22 − dmean]2, ∀j ∈ {1, 2, ..., Cseen} (2)

Here dmean is the mean distance between all the class prototypes and is computedas

dmean =2

C2seen − Cseen

∑i<j

||pi − pj ||22, ∀j ∈ {1, 2, ..., Cseen} (3)

The above regularizer tries to spread out all the class prototypes, without consid-ering the amount of imbalance present in the training data. As has been observedin many recent works [2], due to insufficient number of samples of the minorityclasses, it is more likely that their test samples will have a wider spread instead ofbeing clustered around the class prototype during testing. For our problem, thisimplies greater uncertainty for samples of unseen classes, which are semanticallysimilar to the minority classes in the training set.

Towards this end, we propose to adjust the class prototypes adaptively, whichtakes into account the data imbalance. Since there can be both class and domainimbalance in the cross-domain retrieval problem, we propose to use the totalnumber of sketch and image samples per class in the training set, and we referto this combined number for kth-class as the effective number of samples, neffk ,in this work. We then define the imbalance-based margin for the kth class as,

∆k =K

neffk

(4)


Fig. 1. Illustration of the proposed Adaptive Margin Diversity Regularizer (AMDReg).The AMDReg ensures that the embeddings of the shared prototypes of the images andsketches are not only placed away from each other, but also account for the increaseduncertainty when the training class distribution is imbalanced. This results in bettergeneralization to unseen classes.

This is similar to the inverse frequency of occurrence, except for the experimentalhyper-parameter K. Thus the final AMDReg is given by

R∆(P) =1

Cseen

∑i<j

[||pi − pj ||22 − (dmean +∆j)]2, ∀j ∈ {1, 2, ..., Cseen} (5)

Thus, we adjust the relative distance between pi’s such that they are atleastseparated by a distance which is more than the mean-distance by the classimbalance margin. This ensures that the prototypes for the minority classeshave more margin around them, which will reduce the chances of confusion forthe semantically similar unseen classes during testing. Finally, the encoder withthe prototypes are learnt using the CE loss along with the AMDReg as

LAMDRegCE = LCE + λR∆ (6)

where λ is an experimental hyper-parameter, which controls the contribution ofthe regularizer towards the learning.

Difference with Related Works: Even though the proposed AMDReg isinspired from [11], there are significant differences, namely (1) [11] addresses theimbalanced classification task for a single domain, while our work address gen-eralization to unseen classes in the context of cross-domain retrieval (ZS-SBIR);(2) While [11] ensures that the weight vectors are equally spread out, AMDReg

8 T. Dutta et al.

accounts for the training data distribution while designing the relative distancesbetween the semantic embeddings; (3) Finally, [11] works with the max-marginloss, but AMDReg is used with the standard CE loss while learning the semanticclass prototypes.

The proposed approach also differs from another closely related work LDAM [2].LDAM loss is a modification on the standard cross-entropy or Hinge-loss to in-corporate class-wise margin. In contrast, proposed AMDReg is a margin-basedregularizer with adaptive margins between class-prototypes, based on the corre-sponding representation of classes in the training set. Thus, while [2] is inspiredfrom margin-based generalization bound, the proposed AMDReg is inspired fromthe widely used inverse frequency of occurance.

4.1 Analysis with standard & SOTA imbalance-aware approaches

Here, we analyze how the proposed AMDReg compares with several existingstate-of-the-art techniques used for addressing the problem of imbalance in thetraining data mainly for the task of image classification. These techniques canbe broadly classified into two categories, (1) re-sampling techniques to balancethe existing imbalanced dataset and (2) cost-sensitive learning or modificationof the classifier. For this analysis also, we use the same retrieval backbone [7].In this context, we first compute the average number of samples in the dataset.Any class which has lesser number of samples than the average are consideredminority classes, and the remaining are considered majority classes.1) Re-balancing the dataset: Re-sampling is a standard and effective tech-nique used to balance out the dataset distribution bias. The most common meth-ods are under-sampling of the majority classes [1] or over-sampling of minorityclasses [3]. We systematically use such imbalance data-sampling techniques onthe training data to address the class imbalance for ZS-SBIR as discussed below.Here, the re-sampled / balanced dataset created by individual re-sampling op-erations described below is used for training the baseline network and reportingthe retrieval performance.

1. Naive under-sampling: Here, we randomly select 1/p-th of total samplesper class for the majority classes and discard their remaining samples. Natu-rally, we loose a significant amount of important samples with such randomsampling technique.

2. Selective Decontamination [1]: This technique is used to intelligentlyunder-sample the majority classes instead of randomly throwing away ex-cess samples. As per [1], we also modify the Euclidean distance functiondE(xi,xj) between two samples of cth class, xi and xj as,

dmodified(xi,xj) = (ncN

)(1/m)dE(xi,xj) (7)

where nc and N are the number of samples in cth class and in all classes,respectively. m represents the dimension of the feature space. We retainonly those samples in the majority classes for which the classes of majorityof samples in top-K nearest neighbors agree completely.


3. Naive over-sampling: Here, the minority classes are augmented by re-peating the instances (as in [35]) and using the standard image augmentationtechniques (such as, rotation, translation, flipping etc.).

4. SMOTE [3]: In this intelligent over-sampling technique, instead of replacingthe samples, the minority classes are augmented by generating syntheticfeatures along the line-segment joining each minority class sample with itsK-nearest neighbors.

5. GAN Based Augmentation [29]: Finally, we propose to augment theminority classes by generating features with the help of generative models,which have been very successful for zero-shot [29] / few-shot [19] / any-shot [30] image classification. Towards that goal, we use f-GAN [29] modelto generate synthetic features for the minority classes using their attributesand augment those features with the available training dataset to reduce theimbalance.

2) Cost-sensitive Learning of Classifier: The goal of cost-sensitive learningbased methods is to learn a better classifier using the original imbalanced trainingdata, but with a more suitable loss function which can account for the dataimbalance. To observe the effect of the different kinds of losses, we modify thedistance-based CE-loss in the baseline model to the following ones, keeping therest of the network fixed.

1. Focal loss: This loss [14] was proposed to address foreground-backgroundclass imbalance issue in the context of object detection. It is based on asimple yet effective modification of standard cross-entropy loss, such thatwhile computation, the easy or well-classified samples are given less weightscompared to the difficult samples.

2. Class-balanced Focal Loss: It is a variant of focal loss, recently proposedin [4], which incorporates the effective number of samples for a class in theimbalanced dataset.

3. Diversity Regularizer: This recently proposed regularizer [11] ensuresthat both the majority and minority classes are at equal distance from eachother in the latent-space and reported significant performance improvementfor imbalanced training data for image classification.

4. LDAM: [2] proposes a margin-based modification of standard cross-entropyloss or hinge loss, to ensure that the classes are well-separated from eachother.

The retrieval performance obtained with these imbalance-handling methods arereported in Table 2. We observe that all the techniques result in varying de-gree of improvement over the base model. Among the data augmentation tech-niques, GAN-based augmentation outperforms the other approaches. In general,all the cost-sensitive learning techniques performs quite well, specially the re-cently proposed diversity regularizer and the LDAM cross-entropy loss. However,the proposed AMDReg outperforms both the data balancing and cost-sensitivelearning approaches, giving the best performance across all types and degrees ofimbalance.

10 T. Dutta et al.

Table 2. ZS-SBIR performance (MAP@200) of different kinds of imbalance handlingtechniques applied on the Baseline Model [7] for the mini-Sketchy dataset. Results ofthe original Baseline Model is also reported for reference.

ImbalanceMethods

Long-tailed StepHandler p = 10 p = 100 p = 10 p = 100

Baseline Model [7] 0.234 0.185 0.241 0.156

Data Naive under-sampling 0.235 0.191 0.256 0.159balancing Naive over-sampling 0.269 0.219 0.258 0.155methods Selective decontamination [1] 0.268 0.221 0.251 0.164

SMOTE [3] 0.269 0.217 0.269 0.183GAN-based Augmentation [29] 0.305 0.229 0.274 0.188

Loss- Focal loss [14] 0.273 0.228 0.289 0.195Modification Class-balanced Focal Loss [4] 0.299 0.236 0.296 0.210Techniques Diversity-Regularizer [11] 0.296 0.222 0.285 0.207

LDAM-CE loss [2] 0.329 0.234 0.310 0.213Proposed AMDReg 0.332 0.240 0.315 0.218

5 Experimental Evaluation on ZS-SBIR

Here, we provide details of the extensive experiments performed to evaluate theeffectiveness of the proposed AMDReg for handing data imbalance in ZS-SBIR.

Datasets Used and Experimental Protocol: We have used two large-scale standard benchmarks for evaluating ZS-SBIR approaches, namely, SketchyExt. [25] and TU-Berlin Ext. [8].Sketchy Ext. [25] dataset originally contained approximately 75, 000 sketchesand 12, 500 images from 125 object categories. Later, [15] collected and addedadditional 60, 502 images to this dataset. Following the standard protocol [6][16],we randomly choose 25 classes as unseen-classes (sketches as query and imagesin the search set) and the rest 100 classes for training.TU-Berlin Ext. [8] originally contained 80 hand-drawn sketches per class fromtotal 250 classes. To make it a better fit for large-scale experiments, [34] includedadditional 2, 04, 489 images. As followed in literature [6] [7], we randomly select30-classes as unseen, while the rest 220-classes are used for training.

The dataset statistics are shown in Fig. 2, which depicts data imbalance inboth the datasets. This is specially evident in TU-Berlin Ext., which has hugedomain-wise imbalance as well as class-wise imbalance. These real-world datasetsreinforce the importance of handling data imbalance for the ZS-SBIR task.

5.1 State-of-the-art ZS-SBIR approaches integrated with AMDReg

As already mentioned, the proposed AMDReg is generic and can be seam-lessly integrated with most state-of-the-art ZS-SBIR approaches for handling thetraining data imbalance. Here, we have integrated AMDReg with three state-of-the-art approaches, namely (1) Semantically-tied paired cycle-consistency based


Fig. 2. Dataset statistics of sketches and images of Sketchy-extended and TU-Berlin-extended are shown in the first two and last two plots respectively in that order.

network (SEM-PCYC) [6]; (2) Semantic-aware knowledge preservation for ZS-SBIR (SAKE) [16]. (3) Style-guided network for ZS-SBIR [7]. Now, we brieflydescribe the three approaches along with the integration of AMDReg.

SEM-PCYC [6] with AMDReg: SEM-PCYC is a generative model with twoseparate branches for image and sketch; for visual-to-semantic mapping alongwith cyclic consistency loss. Further, to ensure that the semantic output of thegenerators is also class-discriminative, a classification loss is used. This classi-fier is pre-trained on seen-class training data and kept frozen while the wholeretrieval model is trained. We modify the training methodology by enabling theclassifier to train along with the rest of the model, by including the AMDRegwith the CE-loss. Here, the semantic information is enforced through an auto-encoder, which uses a hierarchical and a text-based model as input, and thusthe weights are randomly initialized. Please refer to [6] for more details.

SAKE [16] with AMDReg: This ZS-SBIR method extends the concept ofdomain-adaptation for fine-tuning a pre-trained model on ImageNet [23] for thespecific ZS-SBIR datasets. The network contains a shared branch to extractfeatures from both sketches and images, which are later used for the categori-cal classification task using the soft-max CE-loss. Simultaneously, the semanticstructure with respect to the ImageNet [23] classes are maintained. Here also, wemodify the CE-loss using the proposed AMDReg to mitigate the adverse effectof training data imbalance. The rest of the branches and the proposed SAKE-loss remain unchanged. Please refer to [16] for more details of the base algorithm.

Style-guide [7] with AMDReg: This is a two-step process, where the sharedlatent-space is learnt first. Then, the latent-space content extracted from thesketch query is combined with the styles of the relevant images to obtain thefinal retrieval in the image-space. While learning the latent-space, a distance-based cross-entropy loss is used, which is modified as explained in details earlier.Please refer to [7] for more details of the base algorithm.

Implementation Details The proposed regularizer is implemented using Py-torch. We use a single Nvidia GeForce GTX TITAN X for all our experiments.

12 T. Dutta et al.

Table 3. Performance of several state-of-the-art approaches for ZS-SBIR and general-ized ZS-SBIR.

AlgorithmsTU-Berlin extended Sketchy-extendedMAP@all Prec@100 MAP@all Prec@100

SBIR

Softmax Baseline 0.089 0.143 0.114 0.172Siamese CNN [22] 0.109 0.141 0.132 0.175SaN [33] 0.089 0.108 0.115 0.125GN Triplet [25] 0.175 0.253 0.204 0.2963D shape [28] 0.054 0.067 0.067 0.078DSH (binary) [15] 0.129 0.189 0.171 0.231GDH (binary) [35] 0.135 0.212 0.187 0.259

ZSL

CMT [27] 0.062 0.078 0.087 0.102DeViSE [10] 0.059 0.071 0.067 0.077SSE [36] 0.089 0.121 0.116 0.161JLSE [37] 0.109 0.155 0.131 0.185SAE [13] 0.167 0.221 0.216 0.293FRWGAN [9] 0.110 0.157 0.127 0.169ZSH [31] 0.141 0.177 0.159 0.214

Zero-ShotSBIR

ZSIH (binary) [26] 0.223 0.294 0.258 0.342ZS-SBIR [32] 0.005 0.001 0.196 0.284

SEM-PCYC [6] 0.297 0.426 0.349 0.463SEM-PCYC + AMDReg 0.330 0.473 0.397 0.494

Style-guide [7] 0.254 0.355 0.375 0.484Style-guide + AMDReg 0.291 0.376 0.410 0.512

SAKE [16] 0.428* 0.534* 0.547 0.692SAKE + AMDReg 0.447 0.574 0.551 0.715

GeneralizedZero-shotSBIR

Style-guide [7] 0.149 0.226 0.330 0.381SEM-PCYC [6] 0.192 0.298 0.307 0.364SEM-PCYC + AMDReg 0.245 0.303 0.320 0.398

For all the experiments, we set λ = 103 and K = 1. Adam optimizer has beenused with β1 = 0.5, β2 = 0.999 and a learning rate of lr = 10−3. The im-plementation of different baselines and the choice of hyper-parameters for theirimplementation has been done as described in the corresponding papers.

5.2 Evaluation for ZS-SBIR

Here, we report the results of the modifications to the state-of-the-art approachesfor ZS-SBIR. We first train all the three original models (as described before) toreplicate the results reported in the respective papers. We use the codes given bythe authors and are able to replicate all the results for SEM-PCYC and Style-guide as reported. However, for SAKE, in two cases, the results we obtained areslightly different from that reported in the paper. So we report the results as weobtained, for fair evaluation of proposed improvement (marked with a star toindicate that they are different from the reported numbers in the paper).

We incorporate the proposed modifications for AMDReg in all three ap-proaches and retrained the models. The results are reported in Table 3. All the


Fig. 3. Performance comparison of the base model (SEM-PCYC) and the modifiedbase-model using proposed AMDReg: (a) Few examples of top-5 retrieved imagesagainst the given unseen sketch query from TU-Berlin dataset; (b) P-R curve onSketchy dataset; (c) P-R curve on TU-Berlin dataset.

results of the other approaches are taken directly from [6]. We observe signifi-cant improvement in the performance of all the state-of-the-art approaches, whentrained using the proposed regularizer. This experiment throws insight that byhandling the data-imabalance, which is inherently present in the collected data,it is possible to gain siginificant improvement in the final performance. SinceAMDReg is generic, it can potentially be incorporated with other approaches,developed for the ZS-SBIR task, to handle the training data imbalance problem.

Fig. 3 shows top-5 retrieved results for a few unseen queries (first column),using SEM-PCYC as the baseline model, without and with AMDReg, respec-tively. We observe significant improvement when AMDReg is used, justifying itseffectiveness. We make similar observations from the P-R curves in Fig. 3.

5.3 Evaluation for Generalized ZS-SBIR

In real scenarios, the search set may consist of both the seen and unseen imagesamples, which makes the problem much more challenging. This is termed asthe generalized ZS-SBIR. To evaluate the effectiveness of proposed AMDReg forthis scenario, we follow the experimental protocol in [6] and SEM-PCYC [6] asthe base model. From the results in Table 3, we observe that AMDReg is ableto significantly improve the performance of the base model and yields state-of-

14 T. Dutta et al.

the-art results for three out of the four cases. Only for Sketchy Ext., it performsslightly less than Style-Guide, but still improves upon its baseline performance.

5.4 Evaluation for SBIR

Though the main purpose of this work is to analyze the effect of training dataimbalance on generalization to unseen classes, this approach should also benefitstandard SBIR in presence of imbalance. We observe from Table 4, that the

Table 4. SBIR evaluation (MAP@200) of Baseline Model [7] on mini-Sketchy.

Balanced Step Imb. GAN-based CB Diversity ProposedData (p = 100) Aug. [29] Focal Loss [4] Regularizer [11] AMDReg

0.839 0.571 0.580 0.613 0.636 0.647

performance of SBIR indeed decreases drastically with training data imbalance.Proposed AMDReg is able to mitigate this by a significant margin as comparedto the other state-of-the-art imbalance handing techniques. We further analyzethe performance of SEM-PCYC [6] on Sketchy Ext. dataset for standard SBIRprotocol with and without AMDReg. We observe significant improvement whenproposed AMDReg is used (MAP@all: 0.811; Prec@100: 0.897) as compared tothe baseline SEM-PCYC (MAP@all: 0.771; Prec@100: 0.871).

6 Conclusion

In this work, for the first time in literature, we analyzed the effect of trainingdata imbalance for the task of generalization to unseen classes in context of ZS-SBIR. We observe that most real-world SBIR datasets are in-fact imbalanced,and that this imbalance does effect the generalization adversely. We system-atically evaluate several state-of-the-art imbalanced mitigating approaches (forclassification) for this problem. Additionally, we propose a novel adaptive mar-gin diversity regularizer (AMDReg), which ensures that the shared latent spaceembeddings of the images and sketches account for the data imbalance in thetraining set. The proposed regularizer is generic, and we show how it can beseamlessly incorporated in three existing state-of-the-art ZS-SBIR approacheswith slight modifications. Finally, we show that the proposed AMDReg resultsin significant improvement in both ZS-SBIR and generalized ZS-SBIR protocols,setting the new state-of-the-art result.

Acknowledgement

This work is partly supported through a research grant from SERB, Departmentof Science and Technology, Government of India.


References

1. Barandela, R., Rangel, E., Sanchez, J.S., Ferri, F.J.: Restricted decontaminationfor the imbalanced training sample problem. Iberoamerican Congress on PatternRecognition, Springer (2003)

2. Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasetswith label-distribution-aware margin loss. NeurIPS (2019)

3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic mi-nority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

4. Cui, Y., Jia, M., Lin, T.Y., Song, Y.: Class-balanced loss based on effective numberof samples. CVPR (2019)

5. Dey, S., Riba, P., Dutta, A., Llados, J., Song, Y.Z.: Doodle to search: practicalzero-shot sketch-based image retrieval. CVPR (2019)

6. Dutta, A., Akata, Z.: Sematically tied paired cycle consistency for zero-shot sketch-based image retrieval. CVPR (2019)

7. Dutta, T., Biswas, S.: Style-guided zero-shot sketch-based image retrieval. BMVC(2019)

8. Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? ACM TOG 31(4),44.1–44.10 (2012)

9. Felix, R., Kumar, V.B., Reid, I., Carneiro, G.: Multi-modal cycle-consistent gen-eralized zero-shot learning. In: ECCV (2018)

10. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov,T.: Devise: A deep visual-semantic embedding model. In: NeurIPS (2013)

11. Hayat, M., Khan, S., Zamir, S.W., Shen, J., Shao, L.: Gaussian affinity for max-margin class imbalanced learning. ICCV (2019)

12. Hu, R., Collomosse, J.: A performance evaluation of gradient field hog descriptorfor sketch based image retrieval. CVIU 117(7), 790–806 (2013)

13. Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In:CVPR (2017)

14. Lin, T.Y., Goyal, P., Girshiick, R., He, K., Dollar, P.: Focal loss for dense objectdetection. arXiv:1708.02002 [cs.CV] (2018)

15. Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: fast free-handsketch-based image retrieval. CVPR (2017)

16. Liu, Q., Xie, L., Wang, H., Yuille, A.: Semantic-aware knowledge preservation forzero-shot sketch-based image retrieval. ICCV (2019)

17. Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailedrecognition in an open world. In: CVPR (2019)

18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. NeurIPS (2013)

19. Mishra, A., Reddy, S.K., Mittal, A., Murthy, H.A.: A generative model for zero-shot learning using conditional variational auto-encoders. CVPR-W (2018)

20. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre-sentation. EMNLP (2014)

21. Qi, H., Brown, M., Lowe, D.G.: Low-shot learning with imprinted weights. CVPR(2018)

22. Qi, Y., Song, Y.Z., Zhang, H., Liu, J.: Sketch-based image retrieval via siameseconvolutional neural network. ICIP (2016)

23. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Li, F.F.: Imagenet: large-scale visual recognition challenge. IJCV 115(3), 211–252 (2015)

16 T. Dutta et al.

24. Saavedra, J.M., Barrios, J.M.: Sketch-based image retrieval using learnedkeyshapes (lks). BMVC (2015)

25. Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning toretrieve badly drawn bunnies. ACM TOG (2016)

26. Shen, Y., Liu, L., Shen, F., Shao, L.: Zero-shot sketch-image hashing. CVPR (2018)27. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-

modal transfer. In: NeurIPS (2013)28. Wang, M., Wang, C., Wu, J.X., Zhang, J.: Community detection in social net-

works: an in-depth benchmarking study with a procedure-oriented framework.VLDB (2015)

29. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. CVPR (2018)

30. Xian, Y., Sharma, S., Schiele, B., Akata, Z.: f-vaegan-d2: A feature generatingframework for any-shot learning. CVPR (2019)

31. Yang, Z., Cohen, W.W., Salakhutdinov, R.: Revisiting semi-supervised learningwith graph embeddings. arXiv preprint arXiv:1603.08861 (2016)

32. Yelamarthi, S.K., Reddy, S.K., Mishra, A., Mittal, A.: A zero-shot framework forsketch-based image retrieval. ECCV (2018)

33. Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M.: Sketch-a-netthat beats humans. BMVC (2015)

34. Zhang, J., Liu, S., Zhang, C., Ren, W., Wang, R., Cao, X.: Sketchnet: sketchclassification with web images. CVPR (2016)

35. Zhang, J., Shen, F., Liu, L., Zhu, F., Yu, M., Shao, L., ad L. V. Gool, H.T.S.:Generative domain-migration hashing for sketch-to-image retrieval. ECCV (2018)

36. Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashingwith regularized similarity learning for image retrieval and person re-identification.IEEE Transactions on Image Processing 24(12), 4766–4779 (2015)

37. Zhang, Z., Saligrama, V.: Zero-shot learning via joint latent similarity embedding.In: CVPR (2016)

adaptive margin diversity regularizer for handling data ... · reported improved performance for...

Documents