few-shot object detection with attention-rpn and …paper, we propose few-shot object detection...

16
Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector Qi Fan * Tencent [email protected] Wei Zhuo * Tencent [email protected] Yu-Wing Tai Tencent [email protected] Abstract Conventional methods for object detection usually re- quires substantial amount of training data and to prepare such high quality training data is labor intensive. In this paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex- amples. Central to our method is the Attention-RPN and the multi-relation module which fully exploit the similar- ity between the few shot training examples and the test set to detect novel objects while suppressing the false detec- tion in background. To train our network, we have pre- pared a new dataset which contains 1000 categories of varies objects with high quality annotations. To the best of our knowledge, this is also the first dataset specifically designed for few shot object detection. Once our network is trained, we can apply object detection for unseen classes without further training or fine tuning. This is also the ma- jor advantage of few shot object detection. Our method is general, and has a wide range of applications. We demonstrate the effectiveness of our method quantitatively and qualitatively on different datasets. The dataset link is: https://github.com/fanq15/Few-Shot-Object-Detection- Dataset. 1. Introduction Object detection has wide range of applications. Exist- ing object detection methods usually rely heavily on a huge amount of annotated data and a long training time. They are also hard to be extended to unseen objects that were not annotated in training data. In contrast, human vision sys- tem is excellent at recognizing new objects even with a little supervision. This inspires us to develop a few-shot object detection algorithm. Few-Shot learning is challenging due to the large ob- ject variance of illumination, shape, texture etc., in the real world. In recent years, it has achieved great pro- gresses [1, 2, 3, 4, 5, 6, 7, 8]. These methods, however, all * equal contribution Figure 1. Explanation of our approach. Given different objects as supports, our approach can detect all objects with same categories in the given query image. focus on image classification, and the few-shot object de- tection is rarely exploited. This is because transferring the experiences from few-shot classification to few-shot object detection is a non-trivial task. Object detection from few shots usually confronts one crucial problem, that is how to localize an unseen object in a cluttered background, namely a generalization problem of object localization from a few training examples of novel categories. It happens that the potential bounding boxes would perfectly miss the unseen objects or have many false detection on background. This could be due to the improper low scores of good bounding boxes in a region proposal network (RPN), which makes an novel object hard to be detected. This problem makes the few-shot object detection intrinsically different from the few-shot classification. In this work, we aim to solve the problem of few-shot object detection. Given a few support set images of target object, our goal is to detect all foreground objects in the test set that belong to the target object category, as shown in Fig. 1. To this aim, we make two major contributions. First, we propose a general few-shot detection model that can be applied to detect novel objects without re-training and fine tuning. Our method fully exploits the matching relationship between object pairs in a siamese network at multiple net- work stages. Experiments show that our model can benefit 1 arXiv:1908.01998v1 [cs.CV] 6 Aug 2019

Upload: others

Post on 14-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

Qi Fan∗

[email protected]

Wei Zhuo∗

[email protected]

Yu-Wing TaiTencent

[email protected]

Abstract

Conventional methods for object detection usually re-quires substantial amount of training data and to preparesuch high quality training data is labor intensive. In thispaper, we propose few-shot object detection which aimsto detect objects of unseen class with a few training ex-amples. Central to our method is the Attention-RPN andthe multi-relation module which fully exploit the similar-ity between the few shot training examples and the test setto detect novel objects while suppressing the false detec-tion in background. To train our network, we have pre-pared a new dataset which contains 1000 categories ofvaries objects with high quality annotations. To the bestof our knowledge, this is also the first dataset specificallydesigned for few shot object detection. Once our networkis trained, we can apply object detection for unseen classeswithout further training or fine tuning. This is also the ma-jor advantage of few shot object detection. Our methodis general, and has a wide range of applications. Wedemonstrate the effectiveness of our method quantitativelyand qualitatively on different datasets. The dataset linkis: https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.

1. Introduction

Object detection has wide range of applications. Exist-ing object detection methods usually rely heavily on a hugeamount of annotated data and a long training time. Theyare also hard to be extended to unseen objects that were notannotated in training data. In contrast, human vision sys-tem is excellent at recognizing new objects even with a littlesupervision. This inspires us to develop a few-shot objectdetection algorithm.

Few-Shot learning is challenging due to the large ob-ject variance of illumination, shape, texture etc., in thereal world. In recent years, it has achieved great pro-gresses [1, 2, 3, 4, 5, 6, 7, 8]. These methods, however, all

∗equal contribution

Figure 1. Explanation of our approach. Given different objects assupports, our approach can detect all objects with same categoriesin the given query image.

focus on image classification, and the few-shot object de-tection is rarely exploited. This is because transferring theexperiences from few-shot classification to few-shot objectdetection is a non-trivial task. Object detection from fewshots usually confronts one crucial problem, that is how tolocalize an unseen object in a cluttered background, namelya generalization problem of object localization from a fewtraining examples of novel categories. It happens that thepotential bounding boxes would perfectly miss the unseenobjects or have many false detection on background. Thiscould be due to the improper low scores of good boundingboxes in a region proposal network (RPN), which makesan novel object hard to be detected. This problem makesthe few-shot object detection intrinsically different from thefew-shot classification.

In this work, we aim to solve the problem of few-shotobject detection. Given a few support set images of targetobject, our goal is to detect all foreground objects in thetest set that belong to the target object category, as shown inFig. 1. To this aim, we make two major contributions. First,we propose a general few-shot detection model that can beapplied to detect novel objects without re-training and finetuning. Our method fully exploits the matching relationshipbetween object pairs in a siamese network at multiple net-work stages. Experiments show that our model can benefit

1

arX

iv:1

908.

0199

8v1

[cs

.CV

] 6

Aug

201

9

Page 2: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

from the attention module at an early stage which enhancesthe proposal quality, and from the multi-relation module ata late stage that suppresses and filter out false detection onconfusing backgrounds. Second, to train the model, webuild a large well-annotated dataset with 1000 categoriesand a few examples for each category. This dataset pro-motes the general learning of object detection. Our methodachieves much better performance by utilizing this datasetthan a benchmark dataset of larger scale, e.g. COCO [9]. Tothe best of our knowledge, we give the first attempt to con-struct a few-shot object detection dataset with such manyclasses.

In summary, we proposed a novel model for few-shotobject detection with a carefully designed attention moduleon RPN and detector. Contributing to the matching strat-egy [10], our model takes an important attribute that itstraining stage is exactly the same as its testing stage. Thisessentially enables our model to online test on objects ofnovel categories, which requires no pre-training or furthernetwork adaptation.

2. Related WorksGeneral Object Detection. Object detection is a clas-

sical problem in computer vision. In early years, objectdetection was usually formulated as a sliding window clas-sification problem using handcrafted features [11, 12, 13].With the rise of deep learning [14], CNN-based methodsbecome the dominant object detection solution. Most ofthe methods can be further divided into two general ap-proaches: proposal-free detectors and proposal-based de-tectors. The first line of works follow a one-stage train-ing strategy and do not generate proposal boxes explic-itly [15, 16, 17, 18, 19]. On the other hand, the secondline, pioneered by R-CNN [20], would first extract class-agnostic region proposals of the potential objects in an im-age, these boxes are then further refined and classified todifferent categories by a specific module [21, 22, 23, 24].This strategy can filter out the majority of negative loca-tions by PRN module and facilitates the sequential detector.For this sake, the RPN-based methods usually perform bet-ter than the proposal-free ones and lead state-of-the-art[24]of detection task. The methods mentioned above, however,work in an intensive supervision manner and are hard to ex-pand to novel categories with only several examples. Thisfact arises the research interest in few-shot learning.

Few-shot learning. Few-shot learning, as a classic prob-lem [25], is challenging for traditional machine learning al-gorithms to learn from just a few training examples. Earlierworks attempt to learn a general prior [26, 27, 28, 29, 30],like hand-designed strokes or parts, that can share acrosscategories. Some other works [31, 32, 1, 33] focus on met-ric learning which aimed at manually designing the distanceformulation among different categories. Recently, a trend

is to design a general agent/strategy that can guide the su-pervised learning within each task, by accumulating knowl-edge it gains the network an attribute to capture the structurevaries across different tasks, this kind of research directionis named meta-learning in general [10, 5, 34, 2, 35]. Inthis area, [10] proposed a siamese network that consists oftwin networks sharing weights, where each network is feda supported image and a query individually. The distancebetween the query and its support is naturally learned bya logistic regression. This matching strategy intrinsicallycaptures the varies between support and query despite theircategories. In the envelope of the matching framework, fol-lowing works [4, 36, 8, 37, 3, 6] focus on enhancing thefeature embedding, where one direction is to build mem-ory modules to capture global contexts among the supports.Some works [38, 39] exploit local descriptor to reap addi-tional knowledge from the limited data. [40, 41] introduceGraph Neural Network (GNN) to model the relationshipbetween different categories. [42] traverses across the en-tire support set and identifies task-relevant features to makemetric learning in high-dimensional space more effective.There are also other works, such as [2, 43], which dedicateson learning a general agent to guide the parameter optimiza-tion.

Until now, few-shot learning has achieved much veryprogress. It, however, mostly focus on the classificationtask, rarely on other computer vision tasks, such as seman-tic segmentation [44, 45, 46], human motion prediction [47]and object detection [48]. As mentioned before, few-shotobject detection is a task intrinsically different from the gen-eral few-shot learning on classification. [49] harnesses un-labeled data and optimizes multiple modules alternativelyon images without box. It may be misled by incorrect de-tection in the weak supervision and requires re-training fora new category and it is out of our scope. LSTD [48] pro-posed a novel few-shot object detection framework that cantransfer knowledge from one large dataset to another smallone, by minimizing the gap of classifying posterior prob-ability between the source domain and the target domain.This method, however, strongly depends on source domainand hard to extend to a scenario very different from it.

Our work is mainly motivated by the research line pio-neered by the matching network [10]. We proposed a gen-eral few-shot object detection deep network that learn thematching on image pairs based on the Faster R-CNN frame-work, which is equipped with our muti-scale and shaped at-tentions.

3. FSOD Dataset: A High-Diverse Few-ShotObject Detection Dataset

The key to few-shot learning lies in the generality abil-ity of the model on novel categories. Thus, a high-diversitydataset with a large number of object categories is necessary

Page 3: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 2. Dataset label tree. The ImageNet categories (red circles) are merged with Open Image categories (green circles). The superclassis borrowed from Open Image.

Train TestNo. Class 800 200No. Image 52350 14152No. Box 147489 35102Avg No. Box / Img 2.82 2.48Min No. Img / Cls 22 30Max No. Img / Cls 208 199Avg No. Img / Cls 75.65 74.31Box Size [6, 6828] [13, 4605]Box Area Ratio [0.0009, 1] [0.0009, 1]Box W/H Ratio [0.0216, 89] [0.0199, 51.5]

Table 1. Dataset Summary. We construct a challenging datasetwith large variance of box size and aspect ratios.

Figure 3. Dataset histogram for image distribution per class.

in order to train a model that is general enough to detect un-seen objects and also to provide a convincing evaluation.However, existing datasets [9, 50, 51, 52, 53] contain verylimited categories and they do not follow the few-shot eval-uation setting. To this aim, we build a new few-shot ob-ject detection dataset. In this section, we will introduce ourdataset and give a full analysis of it.

Dataset Construction We build our dataset from exist-ing massive supervised object detection datasets, i.e. [54,52]. These datasets, however, cannot be used directly dueto 1) the label system of different datasets are inconsis-tent where some objects with the same semantic use dif-ferent words in the datasets; 2) large amount of annotationsare less than satisfactory due to the inaccurate and miss-ing labeling, duplicate boxes, too large objects, and etc. 3)their train/test split contains the same categories, while forfew-shot dataset we want the train/test sets contain differentcategories in order to evaluate its generality on unseen ob-jects. To build the dataset, we first summarize a label systemfrom [54, 52]. By merging the leaf labels in their originallabel trees, group those of same semantics, such as the ice

bear and polar bear, to one category, and remove some se-mantics that does not belong to any leaf categories. Then,we remove the images with bad labeling quality and thosewith boxes of improper size. We remove boxes smaller than0.05% of image size which is usually in bad visual qual-ity and unsuitable to serve as support examples. Next, wefollow the few-shot learning principle to split our data intothe training set and test set whose categories has no over-lap. We construct the training set with categories in MSCOCO Dataset [9] and ImageNet Dataset [50] in case re-searchers need a pretraining stage. We then split the test setwhich contains 200 categories by choosing those with thelargest distance with existing training categories, where thedistance calculates the shortest path that connects the sensesof two phrase in the is-a taxonomy [55]. The remaining cat-egories are merged into the training set that in total contains800 categories. In all, we construct a dataset of 1000 cate-gories with very clear category split for training and testing,where 531 categories come from ImageNet Dataset and 469from Open Image Dataset [52].

Dataset Analysis Our dataset is specifically designed forfew-shot learning and intrinsically designed to evaluate thegenerality of a model on novel categories. Our dataset con-tains 1000 categories with 800/200 split for training and testset separately, around 66,000 images and 182,000 boundingboxes in total. A detailed statics is shown in Table 1 andFig. 3. Our dataset has the following attributes.

• High diversity of Categories: Our dataset contains83 parent semantics, such as mammal, clothing,weapon,etc., which are further split to 1000 leaf cat-egories. Our label tree is shown in Fig. 2. Due to ourstrict dataset split, our train/test set has very differentsemantic categories which largely challenges the eval-uation of models.

• Challenging Setting: Our dataset is challenging forevaluation. It contains objects with large variance onbox size and aspect ratios, and it contains 26.5% im-ages with no less than three objects in the test set. Hereit is worthy to note that our test set contains a largenumber of boxes of other categories which are not in-cluded in our label system. This also creates a greatchallenge for a few-shot model.

Even though our dataset has a large number of cate-

Page 4: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

gories, but our training images and boxes are much less thana benchmark dataset, like MS COCO Dataset, which con-tains 123,287 images and around 886,000 bounding boxes.Thus, our dataset is compact but efficient for few-shot learn-ing.

4. Our MethodologyIn this section, we will introduce our novel few-shot ob-

ject detection network. Before that, we first introduce thetask of few-shot detection.

4.1. Problem Definition

We define the task of few-shot object detection as be-low. Given a supported image sk with a close-up of onetarget object and a separate query image qk which poten-tially contains objects of the support category k, the task isto find all the objects of support category in the query andlabel them with tight bounding boxes. If the support setcontains K categories and N examples for each category,we call it K-way N -shot detection.

4.2. Deep Attentioned Few-Shot Detection

As we knew, few-shot recognition is challenging due tothe fact that it is hard to capture a common sense from justa few examples. To this end, we proposed an intensive-attention network that learns a general matching relation-ship between the support set and queries on both the RPNmodule and the detector. Fig. 4 shows the overall architec-ture of our network.

In particular, we build a siamese framework which hastwo branches and share weights, whose one branch is forsupport set and the other is for the query set, where thequery branch of the siamese framework is a Faster R-CNNnetwork, which contains two stages of RPN and detector.We utilize this framework to train the matching relationshipbetween support and query features, in order to enforce thenetwork to learn general knowledge and the ability to cap-ture common sense among the same categories. Based onthe framework, we introduce a novel attention RPN and de-tection with multi-relation modules to prompt an accurateparsing between support and potential boxes in the query.

4.2.1 Attention-Based Region Proposal Network

In few-shot object detection, RPN as a useful tool can fil-ter out a large number of potential boxes and reduce thepressure of the sequential detector. The RPN does not onlydistinguish between objects and non-objects, but also filterout unseen objects that were not annotated in training data.However, without any support image information, the RPNwill aimlessly active every potential object with high object-ness score, even though they do not belong to the supportcategory. A large amount of irrelevant objects will burden

Type Filter Shape Stride/PaddingAvg Pool 3x3x4096 s1/p0Conv 1x1x512 s1/p0Conv 3x3x512 s1/p0Conv 1x1x2048 s1/p0Avg Pool 3x3x2048 s1/p0

Table 2. Architecture of patch-relation CNN.

subsequent classification in the detector. On account of thisfact, we propose the Attention RPN that involves supportinginformation to filter out the majority of background boxesand those of non-matching categories. Based on that, wecan generate a smaller set of candidate proposals with thehigh potential to contain the target objects. The frameworkis shown in Fig. 5.

We introduce support information to RPN through atten-tion mechanism to guide the RPN to produce more rele-vant proposals and depress proposals of other categories.We calculate the similarity between the feature map of sup-port and that of the query in a depth-wise manner. Thesimilarity map then is utilized to build the proposal gen-eration. In particular, let we denote the features of sup-port as X ∈ tK×K×C and feature map of the query asY ∈ tH×W×C , we formulate this similarity calculation asfollows,

Gk,l,m =∑i,j

Xi,j,m · Yk+i−1,l+j−1,m,

s.t. i, j ∈ {1, ...,K}(1)

where G is the resultant attention feature map. Here thesupport features X is used as the kernel to slide on the queryfeature map, then a depth-wise convolution between themis calculated [56]. This procedure works as calling for theattention of support on the query. In our work, we adopt thefeatures of top layers of the RPN model, i.e. the res4-6 inResNet50. We find that a kernel size of K = 1 performswell in our case. This fact is consistent with the experiencein [22] that the global feature can provide a good objectprior for objectness classification. In our case, the kernelis calculated by averaging on the support feature map. Ourattention map is followed by a 3 × 3 convolution and thenthe ROI pooling and objectiveness classification layer.

4.2.2 Multi-Relation Detector

In an R-CNN framework, an RPN module will be followedby a detector which takes an important role of re-scoringthe proposals and class recognition. Therefore, we wanta detector to have a strong discriminative ability to distin-guish different categories. To this aim, we propose a novelmulti-relation detector to effectively measure the similar-

Page 5: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 4. Our network architecture uses ResNet-50 as backbone. The support image (in green) and query image (in blue) are fed intothe weight-shared backbone. The RPN use attention feature generated by the depth-wise cross correlation between compact 1 × 1 × Csupport feature and H ×W × C query feature. The class score generated by patch-relation head (the top head), global-relation head (themiddle head) and local-correlation head (the bottom head) is added together as the final matching score, and the bounding box predictionare generated by the patch-relation head.

Figure 5. Attention RPN. The support feature is average pooled toa 1× 1×C vector, and then caculate depth-wise cross correlationwith the query feature whose output is used as attention featureand is fed into RPN to generate proposals.

ity between proposal boxes from the query and the sup-port objects. The detector includes three attention mod-ules, which are the patch-relation head to learn a deepnon-linear metric for patch matching, the global-relationhead to learn a deep embedding for global matching and thelocal-correlation head to learn the pixel-wise and depth-wise cross correlation between support and query proposals.We experimentally show that the three matching modulescan complement each other and gains higher performanceincrementally by adding one by one. We will introduce ourmulti-relation detector details below.

• In patch-relation head, we first concatenate the supportand query proposal feature maps in depth. Then thecombined feature map are fed into the patch-relationmodule, whose structure is shown in Table. 2. All theconvolution and pooling layers in this module have 0padding to reduce the feature map from 7× 7 to 1× 1which is used as inputs for the binary classification andregression heads. This module is compact and effi-cient. We do a bit exploitation on the structure of the

model and we find replacing the two average poolingwith convolutions would not improve our model fur-ther.

• The global-relation head extends the patch relation tomodel the global-embedding relation between the sup-port and query proposals. Given a concatenated featureof support and its query proposal, we average poolingthe feature to a vector with a size of 1 × 1 × 2C. Wethen use an MLP with two fully connected (fc) lay-ers followed by ReLU and a final fc layer to generatematching scores.

• Local-correlation head computes the pixel-wise anddepth-wise similarity between object ROI feature andthe proposal feature, like that in Equ. 1. Different fromEqu. 1, we perform dot product on feature pair on thesame depth. In particular, we first use a weight-shared1×1×C convolution to process support and query fea-tures individually. They then calculate the depth-wisesimilarity feature of size 1× 1× C. Finally, a succes-sive fc layer is used to generate matching scores.

We only use the patch-relation head to generate boundingbox predictions, i.e. regression on box coordinates, and usethe sum of all matching scores from the three heads as the fi-nal matching scores. The intra-class variance and imperfectproposals make the relation between proposals and supportobjects complex. Our three relation heads contain differ-ent attributes and can well handle the complex, where thepatch-relation head can generate flexible embedding that beable to match intra-class variances, global-relation head isa stable and general matching, and local-relation patch re-quires matching on parts.

Page 6: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

4.3. Training Strategy

Few-shot object detection is different from classificationbecause of the massive background proposals. The modelnot only needs to distinguish different categories, but alsoneeds to distinguish the background and foreground for atarget category, and the background proposals usually dom-inates in the training. For this reason, besides the basicsingle-way training, we propose a novel multi-way trainingwith ranking proposal to solve the problem.

Single-way Training In order to learn the siamese-baseddetection model, we construct the trainging epidode in thefollowing steps: We randomly choose a query image qkand one separate support image sk containing the same k-th category object to construct a image pair (qk, sk). Inthis pair, only the k-th category object in the query imageis labeled as foreground and all other objects are treated asbackground. During training, we learn to match every pro-posal generated by the Attention-RPN in the query imagewith the object in the support image.

Multi-way Training with Ranking Proposal We pro-pose the multi-way training to distinguish different cate-gories. In the single-way training strategy, the training epi-dode consists of (qk, sk) image pair, and the model learns tomatch objects with the same category, which mainly distin-guishes the foreground and background for one certain cat-egory with different IoU. We construct a image triplet (qk,sk, sm), where m 6= k. In this image triplet, the model notonly needs to distinguish the foreground and backgroundfor the target category with different IoU between (qk, sk),but also distinguish objects with different categories be-tween (qk, sm). However, there are only background in(qk, sm) which results in data imbalance in the foregroundand background proposals and harms the matching train-ing procedure. We propose ranking proposal to balance theforeground and background loss. Firstly, we pick all fore-ground proposals (there are total N foreground proposals)and calculate the foreground loss. Then we calculate thematching scores of all background proposals and rankingthem according their matching scores. We select the top2N background proposals for the (qk, sk) pair and selectthe top N background proposals for the (qk, sm) pair andcalculate the background loss. The final matching loss isthe sum of the foreground loss and background loss.

5. ExperimentsWe train our model on FSOD Dataset training set (800

categories) and evaluate on the test set (200 categories) us-ing mean average precision (mAP) with the thresholds of0.5 and 0.75. We apply our approach on the wild car detec-tion on KITTI [50] datasets to prove the generalization of

our approach. We further apply our approach on the wildpenguin detection [57], but because there is no boundingbox annotation in this dataset, we show some qualitative 5-shot detection results in Fig. 10. All the experiments areimplemented based on PyTorch.

5.1. Training Details

The model is end-to-end trained based on 4 Tesla P40GPUs using SGD with a weight decay of 0.0001 and mo-mentum of 0.9. The learning rate is 0.002 for the first 56000iterations, and 0.0002 for the later 4000 iterations. We takethe advantage of a pretrained model with its backbone, i.e.ResNet50, trained on [14, 9]. As our test set has no overlapwith the datasets, it is safe to use it. During our training,we find that more training iterations will damage perfor-mance. We suppose that too many training iterations makemodel over-fitting on the training set. We fix Res1-3 blocksand only train the high-level layers, which can utilize low-level basic feature and avoid over-fitting. The query imageis resized to shorter edge to 600 pixels and its max size ofthe longer edge is restricted to 1000. The support image iscropped around the target object with 16-pixels image con-text and is resized and zero-padded to a square image of320x320. For few-shot training and testing, we fuse featureby averaging the object features and then fed them to theRPN attention module and the multi-relation detector.

5.2. Evaluation Protocol

K-way N-shot Evaluation We propose a K-way N -shotevaluation to match training and evaluate our approach us-ing standard object detection evaluation tools. For each im-age in the test set, a test eposide is contrcuted by the testimage, N random support images containing its category k,and N support images for each of other K − 1 cateogries,where the K − 1 categories are randomly selected. By test-ing in one episode, we in fact perform a K-way N -shotevaluation, where each query is detected by 5 supports ofits category and the other four non-matching categories.

Realistic Evaluation We provide another evaluation pro-tocol driven by a common real-world application scenario:There are massive images collected from photo albums ordrama series whose label set is absent. We want to annotateone novel target object in these images, but we do not knowwhich images contain the target object, and the size and lo-cation of the target object in an image. In order to reducethe amount of workload, one practicable solution is to findsome images containing the target objects, annotate them,and then apply our method to automatically annotate the re-maining. Following this setting, we perform the evaluationas follows: We mix all test images of FSOD dataset, and foreach object category, we pick 5 good images that containthe target object to perform 1-way 5-shot testing of each

Page 7: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

No. ST PR LC GR AR MC MS5-shot

1-way 5-wayAP50 AP75 AP50 AP75

I [22] 0.414 0.267 0.273 0.179II X 0.619 0.438 0.333 0.244III [8] X X 0.686 0.487 0.429 0.316IV X X 0.689 0.484 0.465 0.340V X X 0.681 0.490 0.457 0.339VI X X X 0.682 0.478 0.489 0.353VII X X X X 0.693 0.480 0.493 0.351VIII X X X X 0.690 0.488 0.504 0.367IX X X X X X 0.698 0.491 0.506 0.367X X X X X X X 0.684 0.457 0.594 0.399XI X X X X X X X 0.686 0.452 0.616 0.408

Table 3. Our few-shot results on FSOD Dataset. ST: Siamese Training, PR: Patch-Relation Head, LC: Local-Correlation Head, GR:Global-Relation Head, AR: Attention RPN, MC: Multi-class training, MS: Multi-shot training.

Figure 6. The comparisons with related work are evaluated by therealistic protocol. AP50 is accumulated averaged on the top-k cat-egories, where good supports are used in test.

category. Note that, for each object category, we evaluateall the images in the test set whatever the image contains,which is actually an all-way 5-shot testing for each image.This evaluation is very challenging, where the target imagesonly account for average 1.4% of the test set.

5.3. Results on FSOD Dataset

We evaluate our model and baselines on FSOD Dataset.In addition, we do the ablation study by incrementallyadding different proposed modules of our model. Experi-mental results in Table 3 show that our model beats base-lines and all the proposed modules contribute to the finalperformance improvement.

5.3.1 Comparison with benchmarks

We compared with related works in Table 3, where the set-ting I is an application of Faster R-CNN [22] by comparingthe detector fc feature between query and support propos-

als (II is its siamese training version), and III mimics Rela-tion Network [8] with implementation details adjusted fordetection. Our method surpasses these methods by a largemargin, especially in 5-way evaluation.

The LSTD [58] is different from ours in framework andapplication, and needs to train on new cateogies by transfer-ring knowledge from a source domain to the target domain.Our method, however, can apply on new categories with-out any further re-training or finetuning. This is a fun-damental difference between ours and LSTD. To compareempirically, we adjust LSTD to base on Faster R-CNN andre-train it on 5 fixed supports for each test category sepa-rately in a fair configuration. Results are shown in Figure 6.We achieve 0.480 AP50 on 50 categories that demonstratesthe effectiveness of our approach again. Our method beatsLSTD by 2.93% (27.15% v.s. 24.22%) and its backboneFaster R-CNN by 4.11% (27.15% v.s. 23.04%) on all 200testing categories. More specifically, without pre-trainingon our dataset, Faster R-CNN’s AP50 decreases dramati-cally. It demonstrates the effectiveness of our dataset.

5.3.2 Ablation study

Multi-Relation Detector Our proposed multi-relationdetectors all perform well in the 5-shot evaluation. Specifi-cally, viewing each detector individually, the patch-relationhead performs better in the 1-way evaluation, while local-correlation head performs better in the 5-way evaluation.This indicates that the path-relation head has a good prop-erty to distinguish an object from background, while thelocal-correlation head has the advantage to distinguish non-matching object from target. By adding the global-relationhead, the combination of full three relation heads gets bestresults. This fact demonstrates that these three detectors arecomplementary to each other, and they can better differen-tiate targets from background and non-matching categories.

Page 8: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 7. The average recall over different IoUs on top 100 pro-posals generated by RPN.

Qualitative 5-shot object detection results on our test set areshown in Fig. 9.

Attention RPN Comparing the models with attentionRPN and that with the regular RPN(Model VI and VII,Model VIII and IX) in Table 3, attention RPN gets betterperformance especially in 1-way evaluation. To evaluate theproposal quality, we calculate the average recall over differ-ent IoUs on top 100 proposals. From Fig. 7 we can see thatthe attention RPN can generate more high-quality proposals(higher IoUs) and benefits from few-shot supports. We thenevaluate the average best overlap ratio (ABO) across groundtruth boxes for these two RPNs, the ABO of attention RPNis 0.7184/0.7242 (1-shot/5-shot) and the same metric of reg-ular RPN is only 0.7076. We gets the best performance onthe model combining multi-relation detector with attentionRPN. All these results illustrate that our proposed attentionRPN can generate better proposals and benefits the final de-tection predictions.

Multi-way training strategy We train our model withmulti-way strategy with ranking proposal, and we obtain8.8% AP50 improvement on 5-way 5-shot evaluation com-pared with the single-way strategy. It means that it is impor-tant to learn to distinguish different categories during train-ing. With 5-shot training, we can achieve further improve-ments which is also verified in [1] that few-shot training isbeneficial for few-shot testing.

5.3.3 Dataset with more categories or more samples?

Our proposed dataset has massive object categories withfew images, which is a beneficial property for few-shot ob-ject detection. To confirm this view, we train our model onMS COCO dataset, which has more than 120,000 imageswith only 80 categories, and further train our model on dif-ferent training splits ranging from 200 to 600 categories onFSOD Dataset. When we split the FSOD Dataset, we guar-antee that all the splits have similar image number and theresults can only be affected by the category number. Wepresent the experiment results in Fig. 8. From this analysis,

Figure 8. The effect of category number on model performance.The circle area is in proportion to training image number.

we can find that although MS COCO has the most trainingimages but its model performance is worse, while modelstrained on FSOD Dataset have better performance when thenumber of categories incrementally increase. It indicatesthat few categories with too many images may impede few-shot object detection, while large number of categories canalways bring benefits for this task. In all, we proved thatthe category diversity is essential for the few-shot objectdetection and it can always get benefits when number ofcategories increase.

5.4. Results on KITTIOur approach is general and can be easily extended to

other applications. We evaluate our approach in wild cardetection on KITTI [50]. It is an urban scene dataset fordriving scenarios, whose images are captured by the car-mounted video camera. We evaluate on KITTI trainingset with 7481 images. Because the car category is in theCOCO dataset, we discard the COCO pretrained model andonly use the ImageNet pretrain model. [59] uses massiveannotated data from the Cityscape [60] dataset to train theFaster R-CNN and evaluate on KITTI. Without any furtherre-training or finetuning, our model obatins similar perfor-mance (49.7% v.s. 53.5%) on the wild car detection.

6. ConclusionIn this paper, we have introduced a novel few-shot object

detection network with Attention RPN and multi-relationmodule which fully exploit the similarity between the few-shot set and the test set on novel object detection. To trainour model, we also have prepared a new dataset which con-tains 1000 categories of varies objects with high-quality an-notations. Contributing to the matching strategy, our modelcan online detect objects of novel categories, which requiresno pre-training or further network adaptation. Our modelis validated quantitative and qualitative results on differentdatasets. In all, we introduce a general model and estab-lish a few-shot object detection dataset which can give greatbenefit to the research and application of this field.

Page 9: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 9. Qualitative 5-shot detection results of our approach results on FSOD test set. We recommend to view the figure by zooming it up.

Figure 10. Our application results on the penguin dataset [57]. Given 5 penguin images as support, our approach can detect all penguins inthe wild in the given query image.

References[1] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi-

cal networks for few-shot learning. In Advances in NeuralInformation Processing Systems, pages 4077–4087, 2017. 1,2, 8

[2] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. In In International Conference onLearning Representations (ICLR), 2017. 1, 2

[3] Adam Santoro, Sergey Bartunov, Matthew Botvinick, DaanWierstra, and Timothy Lillicrap. Meta-learning withmemory-augmented neural networks. In International con-ference on machine learning, pages 1842–1850, 2016. 1, 2

[4] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wier-stra, et al. Matching networks for one shot learning. InAdvances in Neural Information Processing Systems, pages3630–3638, 2016. 1, 2

[5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In ICML, 2017. 1, 2

[6] Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and TaoMei. Memory matching networks for one-shot image recog-

nition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4080–4088, 2018. 1,2

[7] Spyros Gidaris and Nikos Komodakis. Dynamic few-shotvisual learning without forgetting. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition, pages 4367–4375, 2018. 1

[8] Flood Sung Yongxin Yang, Li Zhang, Tao Xiang, Philip HSTorr, and Timothy M Hospedales. Learning to compare: Re-lation network for few-shot learning. In Proc. of the IEEEConference on Computer Vision and Pattern Recognition(CVPR), Salt Lake City, UT, USA, 2018. 1, 2, 7

[9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755.Springer, 2014. 2, 3, 6

[10] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.Siamese neural networks for one-shot image recognition. InICML Deep Learning Workshop, volume 2, 2015. 2

[11] Navneet Dalal and Bill Triggs. Histograms of oriented gra-dients for human detection. In international Conference

Page 10: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

on computer vision & Pattern Recognition (CVPR’05), vol-ume 1, pages 886–893. IEEE Computer Society, 2005. 2

[12] Pedro F Felzenszwalb, Ross B Girshick, David McAllester,and Deva Ramanan. Object detection with discriminativelytrained part-based models. IEEE transactions on patternanalysis and machine intelligence, 32(9):1627–1645, 2010.2

[13] P VIODA. Rapid object detection using a boosted cascade ofsimple features. In Proc. IEEE CVPR 2001, pages 905–910,2001. 2

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 1097–1105, 2012. 2, 6

[15] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object de-tection. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 779–788, 2016. 2

[16] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,stronger. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 7263–7271, 2017. 2

[17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In European con-ference on computer vision, pages 21–37. Springer, 2016. 2

[18] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Dollar. Focal loss for dense object detection. In Pro-ceedings of the IEEE international conference on computervision, pages 2980–2988, 2017. 2

[19] Songtao Liu, Di Huang, et al. Receptive field block net foraccurate and fast object detection. In Proceedings of the Eu-ropean Conference on Computer Vision (ECCV), pages 385–400, 2018. 2

[20] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages580–587, 2014. 2

[21] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter-national conference on computer vision, pages 1440–1448,2015. 2

[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In Advances in neural information pro-cessing systems, pages 91–99, 2015. 2, 4, 7

[23] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017. 2

[24] Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper:Efficient multi-scale training. In Advances in Neural Infor-mation Processing Systems, pages 9333–9343, 2018. 2

[25] Sebastian Thrun. Is learning the n-th thing any easier thanlearning the first? In Advances in neural information pro-cessing systems, pages 640–646, 1996. 2

[26] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learningof object categories. IEEE transactions on pattern analysisand machine intelligence, 28(4):594–611, 2006. 2

[27] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, andJoshua Tenenbaum. One shot learning of simple visual con-cepts. In Proceedings of the Annual Meeting of the CognitiveScience Society, volume 33, 2011. 2

[28] Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenen-baum. One-shot learning by inverting a compositional causalprocess. In Advances in neural information processing sys-tems, pages 2526–2534, 2013. 2

[29] Brenden M Lake, Ruslan Salakhutdinov, and Joshua BTenenbaum. Human-level concept learning through proba-bilistic program induction. Science, 350(6266):1332–1338,2015. 2

[30] Alex Wong and Alan L Yuille. One shot learning via com-positions of meaningful patches. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 1197–1205, 2015. 2

[31] Boris Oreshkin, Pau Rodrıguez Lopez, and Alexandre La-coste. Tadam: Task dependent adaptive metric for improvedfew-shot learning. In Advances in Neural Information Pro-cessing Systems, pages 719–729, 2018. 2

[32] Eleni Triantafillou, Richard Zemel, and Raquel Urtasun.Few-shot learning through an information retrieval lens. InAdvances in Neural Information Processing Systems, pages2255–2265, 2017. 2

[33] Bharath Hariharan and Ross Girshick. Low-shot visualrecognition by shrinking and hallucinating features. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 3018–3027, 2017. 2

[34] Tsendsuren Munkhdalai and Hong Yu. Meta networks. InProceedings of the 34th International Conference on Ma-chine Learning-Volume 70, pages 2554–2563. JMLR. org,2017. 2

[35] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, andAdam Trischler. Rapid adaptation with conditionally shiftedneurons. In ICML, 2018. 2

[36] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng,and Trevor Darrell. Few-shot object detection via featurereweighting. arXiv preprint arXiv:1812.01866, 2018. 2

[37] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and BharathHariharan. Low-shot learning from imaginary data. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 7278–7286, 2018. 2

[38] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Gao Yang, andJiebo Luo. Revisiting local descriptor based image-to-classmeasure for few-shot learning. In CVPR, 2019. 2

[39] Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and AndreiBursuc. Dense classification and implanting for few-shotlearning. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 9258–9267,2019. 2

Page 11: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

[40] Sungwoong Kim Chang D. Yoo Jongmin Kim, Taesup Kim.Edge-labeling graph neural network for few-shot learning. InComputer Vision and Pattern Recognition (CVPR), 2019. 2

[41] Spyros Gidaris and Nikos Komodakis. Generating classifi-cation weights with gnn denoising autoencoders for few-shotlearning. In CVPR, 2019. 2

[42] Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler,and Xiaogang Wang. Finding task-relevant features for few-shot learning by category traversal. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1–10, 2019. 2

[43] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70, pages 1126–1135. JMLR. org,2017. 2

[44] Nanqing Dong and Eric P Xing. Few-shot semantic segmen-tation with prototype learning. In BMVC, volume 3, page 4,2018. 2

[45] Claudio Michaelis, Matthias Bethge, and Alexander S.Ecker. One-shot segmentation in clutter. In ICML, 2018.2

[46] Tao Hu, Pengwan, Chiliang Zhang, Gang Yu, Yadong Mu,and Cees G. M. Snoek. Attention-based multi-context guid-ing for few-shot semantic segmentation. In AAAI, 2019. 2

[47] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and JoseM. F. Moura. Few-shot human motion prediction via meta-learning. In ECCV, 2018. 2

[48] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. Lstd:A low-shot transfer detector for object detection. In AAAI,2018. 2

[49] Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and DeyuMeng. Few-example object detection with model communi-cation. IEEE transactions on PAMI, 2018. 2

[50] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the kitti vision benchmarksuite. In Conference on Computer Vision and Pattern Recog-nition (CVPR), 2012. 3, 6, 8

[51] Mark Everingham, Luc Van Gool, Christopher KI Williams,John Winn, and Andrew Zisserman. The pascal visual objectclasses (voc) challenge. International journal of computervision, 88(2):303–338, 2010. 3

[52] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, StefanPopov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari.The open images dataset v4: Unified image classification,object detection, and visual relationship detection at scale.arXiv:1811.00982, 2018. 3

[53] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome:Connecting language and vision using crowdsourced denseimage annotations. International Journal of Computer Vi-sion, 123(1):32–73, 2017. 3

[54] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Pro-cessing Systems 25, pages 1097–1105. Curran Associates,Inc., 2012. 3

[55] George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995. 3

[56] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, JunliangXing, and Junjie Yan. Siamrpn++: Evolution of siamesevisual tracking with very deep networks. arXiv preprintarXiv:1812.11703, 2018. 4

[57] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in thewild. In European Conference on Computer Vision, 2016. 6,9

[58] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. Lstd: Alow-shot transfer detector for object detection. arXiv preprintarXiv:1803.01529, 2018. 7

[59] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, andLuc Van Gool. Domain adaptive faster r-cnn for object de-tection in the wild. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3339–3348, 2018. 8

[60] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In Proc.of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. 8

Appendix A: More Implementation DetailsHere we show more experimental details. In our experi-

ments, we use top 100 proposals generated by RPN to pre-dict final object detection results. In section 5.2 of the mainpaper, we present the effect of category number on modelperformance. In this experiment, we adjust the training it-eration according to the number of training images. Specif-ically, the total training iteration on FSOD dataset and MSCOCO dataset is 40,000 and 120,000 respectively. In sec-tion 5.3, we fintune our model on FSOD training set withfew shot and get better performance. Here we fix all layersand only fintune the RPN and our proposed multi-relationdetector.

Appendix B: Qualitative Visualization ResultsHere we show qualitative results for the ablation study

in Fig. 11. From the visual results, we can see the advan-tage of our model over baselines and the advantage of ourdataset over benchmark dataset, e.g. MS COCO. Here, wepresent results on three categories, i.e. blackboard, segwayand turban, which are very distinct with the training sam-ples in appearance. From Fig. 11, we can see that both the

Page 12: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 11. Qualitative results for ablation study. We visualize the boxes with scores larger than 0.8 or the box with highest score in (3)(c).

novel attention RPN and multi-relation detector play an im-portant role in accurate detection. For example, in (1-2)(b)the baseline 2 (Model III) fails to detect some targets whileour model with only the multi-relation detector can detectmore targets and suppress backgrounds. In (2-3)(d), our fullmodel further correct the detection and pull up the scores ontrue targets by adding the Attention RPN, especially in theinstance of (3)(d). Comparing results of our model trainedon our dataset with that on MS COCO dataset, shown inFig. 11 (1-3)(a), our model can rarely learn a matching be-tween the support and query target when we train it on MSCOCO, but it can capture the similarity between the imagepair when it trains on our dataset. More detection of ourfull model on 5-shot setting is shown in Fig. 12 and Fig. 13.Our model performs well even when there are objects ofnon-support categories in the query image.

Appendix C: FSOD Dataset Class SplitHere we describe the training/testing class split in our

proposed FSOD Dataset. This split was used in our experi-ments described in section 5.3.

Ltrain =

lipstick, sandal, crocodile, football helmet, umbrella,houseplant, antelope, woodpecker, palm tree, box, swan,miniskirt, monkey, cookie, scissors, snowboard, hedgehog,penguin, barrel, wall clock, strawberry, window blind,butterfly, television, cake, punching bag, picture frame,face powder, jaguar, tomato, isopod, balloon, vase, shirt,waffle, carrot, candle, flute, bagel, orange, wheelchair, golfball, unicycle, surfboard, cattle, parachute, candy, turkey,

pillow, jacket, dumbbell, dagger, wine glass, guitar, shrimp,worm, hamburger, cucumber, radish, alpaca, bicycle wheel,shelf, pancake, helicopter, perfume, sword, ipod, goose,pretzel, coin, broccoli, mule, cabbage, sheep, apple, flag,horse, duck, salad, lemon, handgun, backpack, printer,mug, snowmobile, boot, bowl, book, tin can, football,human leg, countertop, elephant, ladybug, curtain, wine,van, envelope, pen, doll, bus, flying disc, microwave oven,stethoscope, burrito, mushroom, teddy bear, nail, bottle,raccoon, rifle, peach, laptop, centipede, tiger, watch, cat,ladder, sparrow, coffee table, plastic bag, brown bear,frog, jeans, harp, accordion, pig, porcupine, dolphin, owl,flowerpot, motorcycle, calculator, tap, kangaroo, lavender,tennis ball, jellyfish, bust, dice, wok, roller skates, mango,bread, computer monitor, sombrero, desk, cheetah, icecream, tart, doughnut, grapefruit, paddle, pear, kite, eagle,towel, coffee, deer, whale, cello, lion, taxi, shark, humanarm, trumpet, french fries, syringe, lobster, rose, humanhand, lamp, bat, ostrich, trombone, swim cap, humanbeard, hot dog, chicken, leopard, alarm clock, drum, taco,digital clock, starfish, train, belt, refrigerator, dog bed, bellpepper, loveseat, infant bed, training bench, milk, mixingbowl, knife, cutting board, ring binder, studio couch, filingcabinet, bee, caterpillar, sofa bed, violin, traffic light,airplane, closet, canary, toilet paper, canoe, spoon, fox,tennis racket, red panda, cannon, stool, zucchini, rugbyball, polar bear, bench, pizza, fork, barge, bow and arrow,kettle, goldfish, mirror, snail, poster, drill, tie, gondola,scale, falcon, bull, remote control, horn, hamster, vol-leyball, stationary bicycle, dishwasher, limousine, shorts,toothbrush, bookcase, baseball glove, computer mouse,otter, computer keyboard, shower, teapot, human foot,

Page 13: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 12. Qualitative 5-shot object detection results on our test set. We visualize the bounding boxes with score larger than 0.8.

Page 14: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

Figure 13. Qualitative results of our 5-shot object detection on test set. We visualize the bounding boxes with score larger than 0.8.

Page 15: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

parking meter, ski, beaker, castle, mobile phone, suitcase,sock, cupboard, crab, common fig, missile, swimwear,saucer, popcorn, coat, plate, stairs, pineapple, parrot,fountain, binoculars, tent, pencil case, mouse, sewingmachine, magpie, handbag, saxophone, panda, flashlight,baseball bat, golf cart, banana, billiard table, tower, wash-ing machine, lizard, brassiere, ant, crown, oven, sea lion,pitcher, chest of drawers, crutch, hippopotamus, artichoke,seat belt, microphone, lynx, camel, rabbit, rocket, toilet,spider, camera, pomegranate, bathtub, jug, goat, cowboyhat, wrench, stretcher, balance beam, necklace, scoreboard,horizontal bar, stop sign, sushi, gas stove, tank, armadillo,snake, tripod, cocktail, zebra, toaster, frying pan, pasta,truck, blue jay, sink, lighthouse, skateboard, cricket ball,dragonfly, snowplow, screwdriver, organ, giraffe, subma-rine, scorpion, honeycomb, cream, cart, koala, guacamole,raven, drawer, diaper, fire hydrant, potato, porch, banjo,hammer, paper towel, wardrobe, soap dispenser, asparagus,skunk, chainsaw, spatula, ambulance, submarine sandwich,axe, ruler, measuring cup, scarf, squirrel, tea, whisk,food processor, tick, stapler, oboehartebeest, modem,shower cap, mask, handkerchief, falafel, clipper, croquette,house finch, butterfly fish, lesser scaup, barbell, hair slide,arabian camel, pill bottle, springbok, camper, basketballplayer, bumper car, wisent, hip, wicket, medicine ball,sweet orange, snowshoe, column, king charles spaniel,crane, scoter, slide rule, steel drum, sports car, go kart,gearing, tostada, french loaf, granny smith, sorrel, ibex,rain barrel, quail, rhodesian ridgeback, mongoose, redbacked sandpiper, penlight, samoyed, pay phone, barberchair, wool, ballplayer, malamute, reel, mountain goat,tusker, longwool, shopping cart, marble, shuttlecock, redbreasted merganser, shutter, stamp, letter opener, canopicjar, warthog, oil filter, petri dish, bubble, african crocodile,bikini, brambling, siamang, bison, snorkel, loafer, kiteballoon, wallet, laundry cart, sausage dog, king penguin,diver, rake, drake, bald eagle, retriever, slot, switchblade,orangutan, chacma, guenon, car wheel, dandie dinmont,guanaco, corn, hen, african hunting dog, pajama, hay,dingo, meat loaf, kid, whistle, tank car, dungeness crab,pop bottle, oar, yellow lady’s slipper, mountain sheep,zebu, crossword puzzle, daisy, kimono, basenji, solar dish,bell, gazelle, agaric, meatball, patas, swing, dutch oven,military uniform, vestment, cavy, mustang, standard poo-dle, chesapeake bay retriever, coffee mug, gorilla, bearskin,safety pin, sulphur crested cockatoo, flamingo, eider, picketfence, dhole, spaghetti squash, african elephant, coralfungus, pelican, anchovy pear, oystercatcher, gyromitra,african grey, knee pad, hatchet, elk, squash racket, mallet,greyhound, ram, racer, morel, drumstick, bovine, bul-let train, bernese mountain dog, motor scooter, vervet,quince, blenheim spaniel, snipe, marmoset, dodo, cowboyboot, buckeye, prairie chicken, siberian husky, ballpoint,

mountain tent, jockey, border collie, ice skate, button,stuffed tomato, lovebird, jinrikisha, pony, killer whale,indian elephant, acorn squash, macaw, bolete, fiddler crab,mobile home, dressing table, chimpanzee, jack o’ lantern,toast, nipple, entlebucher, groom, sarong, cauliflower,apiary, english foxhound, deck chair, car door, labradorretriever, wallaby, acorn, short pants, standard schnauzer,lampshade, hog, male horse, martin, loudspeaker, plum,bale, partridge, water jug, shoji, shield, american lobster,nailfile, poodle, jackfruit, heifer, whippet, mitten, eggnog,weimaraner, twin bed, english springer, dowitcher, rhesus,norwich terrier, sail, custard apple, wassail, bib, bullet,bartlett, brace, pick, carthorse, ruminant, clog, screw,burro, mountain bike, sunscreen, packet, madagascarcat, radio telescope, wild sheep, stuffed peppers, okapi,bighorn, grizzly, jar, rambutan, mortarboard, raspberry,gar, andiron, paintbrush, running shoe, turnstile, leonberg,red wine, open face sandwich, metal screw, west highlandwhite terrier, boxer, lorikeet, interceptor, ruddy turnstone,colobus, pan, white stork, stinkhorn, american coot, trailertruck, bride, afghan hound, motorboat, bassoon, quesadilla,goblet, llama, folding chair, spoonbill, workhorse, pimento,anemone fish, ewe, megalith, pool ball, macaque, kit fox,oryx, sleeve, plug, battery, black stork, saluki, bath towel,bee eater, baboon, dairy cattle, sleeping bag, panpipe,gemsbok, albatross, comb, snow goose, cetacean, bucket,packhorse, palm, vending machine, butternut squash,loupe, ox, celandine, appenzeller, vulture, crampon, back-board, european gallinule, parsnip, jersey, slide, guava,cardoon, scuba diver, broom, giant schnauzer, gordonsetter, staffordshire bullterrier, conch, cherry, jam, salmon,matchstick, black swan, sailboat, assault rifle, thatch,hook, wild boar, ski pole, armchair, lab coat, goldfinch,guinea pig, pinwheel, water buffalo, chain, ocarina, impala,swallow, mailbox, langur, cock, hyena, marimba, hound,knot, saw, eskimo dog, pembroke, sealyham terrier, italiangreyhound, shih tzu, scotch terrier, yawl, lighter, dungbeetle, dugong, academic gown, blanket, timber wolf,minibus, joystick, speedboat, flagpole, honey, chessman,club sandwich, gown, crate, peg, aquarium, whoopingcrane, headboard, okra, trench coat, avocado, cayuse, largeyellow lady’s slipper, ski mask, dough, bassarisk, bridalgown, terrapin, yacht, saddle, redbone, shower curtain,jennet, school bus, otterhound, irish terrier, carton, abaya,window shade, wooden spoon, yurt, flat coated retriever,bull mastiff, cardigan, river boat, irish wolfhound, oxygenmask, propeller, earthstar, black footed ferret, rockingchair, beach wagon, litchi, pigeon.

Ltest =

beer, musical keyboard, maple, christmas tree, hikingequipment, bicycle helmet, goggles, tortoise, whiteboard,

Page 16: Few-Shot Object Detection with Attention-RPN and …paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training ex-amples. Central

lantern, convenience store, lifejacket, squid, watermelon,sunflower, muffin, mixer, bronze sculpture, skyscraper,drinking straw, segway, sun hat, harbor seal, cat furni-ture, fedora, kitchen knife, hand dryer, tree house, ear-rings, power plugs and sockets, waste container, blender,briefcase, street light, shotgun, sports uniform, wood burn-ing stove, billboard, vehicle registration plate, ceiling fan,cassette deck, table tennis racket, bidet, pumpkin, tabletcomputer, rhinoceros, cheese, jacuzzi, door handle, swim-ming pool, rays and skates, chopsticks, oyster, office build-ing, ratchet, salt and pepper shakers, juice, bowling equip-ment, skull, nightstand, light bulb, high heels, picnic basket,platter, cantaloupe, croissant, dinosaur, adhesive tape, me-chanical fan, winter melon, egg, beehive, lily, cake stand,treadmill, kitchen & dining room table, headphones, winerack, harpsichord, corded phone, snowman, jet ski, fire-place, spice rack, coconut, coffeemaker, seahorse, tiara,light switch, serving tray, bathroom cabinet, slow cookerjalapeno, cartwheel, laelia, cattleya, bran muffin, caribou,buskin, turban, chalk, cider vinegar, bannock, persimmon,wing tip, shin guard, baby shoe, euphonium, popover, pul-ley, walking shoe, fancy dress, clam, mozzarella, peccary,spinning rod, khimar, soap dish, hot air balloon, windmill,manometer, gnu, earphone, double hung window, conserve,claymore, scone, bouquet, ski boot, welsh poppy, puffball,sambuca, truffle, calla lily, hard hat, elephant seal, peanut,hind, jelly fungus, pirogi, recycling bin, in line skate,bialy, shelf bracket, bowling shoe, ferris wheel, stanho-pea, cowrie, adjustable wrench, date bread, o ring, caryatid,leaf spring, french bread, sergeant major, daiquiri, sweetroll, polypore, face veil, support hose, chinese lantern, tri-angle, mulberry, quick bread, optical disk, egg yolk, shal-lot, strawflower, cue, blue columbine, silo, mascara, cherrytomato, box wrench, flipper, bathrobe, gill fungus, black-board, thumbtack, longhorn, pacific walrus, streptocarpus,addax, fly orchid, blackberry, kob, car tire, sassaby, fishingrod, baguet, trowel, cornbread, disa, tuning fork, virginiaspring beauty, samosa, chigetai, blue poppy, scimitar, shirtbutton.