semantic label sharing for learning with many 1 categories 2fergus/drafts/ssl_eccv.pdf · 81 across...
TRANSCRIPT
Semantic Label Sharing for Learning with Many1 1
Categories2 2
Anonymous ECCV submission3 3
Paper ID 7474 4
Abstract. In an object recognition scenario with tens of thousands of5 5
categories, even a small number of labels per category leads to a very6 6
large number of total labels required. We propose a simple method of7 7
label sharing between semantically similar categories. We leverage the8 8
WordNet hierarchy to define semantic distance between any two cate-9 9
gories and use this semantic distance to share labels. Our approach can10 10
be used with any classifier. Experimental results on a range of datasets,11 11
upto 80 million images and 75,000 categories in size, show that despite12 12
the simplicity of the approach, it leads to significant improvements in13 13
performance.14 14
1 Introduction15 15
Large image collections on the Internet and elsewhere contain a multitude of16 16
scenes and objects. Recent work in computer vision has explored the problems17 17
of visual search and recognition in this challenging environment. However, all18 18
approaches require some amount of hand-labeled training data in order to build19 19
effective models. Working with large numbers of images creates two challenges:20 20
first, labeling a representative set of images and, second, developing efficient21 21
algorithms that scale to very large databases.22 22
Labeling Internet imagery is challenging in two respects: first, the sheer num-23 23
ber of images means that the labels will only ever cover a small fraction of images.24 24
Recent collaborative labeling efforts such as Peekaboom, LabelMe, ImageNet [2–25 25
4] have gathered millions of labels at the image and object level. However this26 26
is but a tiny fraction of the estimated 10 billion images on Facebook, let alone27 27
the hundreds of petabytes of video on YouTube. Second, the diversity of the28 28
data means that many thousands of classes will be needed to give an accurate29 29
description of the visual content. Current recognition datasets use 10’s to 100’s30 30
of classes which give a hopelessly coarse quantization of images into discrete31 31
categories. The richness of our visual world is reflected by the enormous number32 32
of nouns present in our language: English has around 70,000 that correspond to33 33
actual objects [5, 6]. This figure loosely agrees with the 30,000 visual concepts34 34
estimated by psychologists [7]. Furthermore, having a huge number of classes di-35 35
lutes the available labels, meaning that, on average, there will be relatively few36 36
annotated examples per class (and many classes might not have any annotated37 37
data).38 38
2 ECCV-10 submission ID 747
“Turb
op
rop
” se
arch
en
gin
e re
sult
Re
-ran
ked
ima
ge
s “
Po
ny
” s
ea
rch
en
gin
e r
esu
ltR
e-r
an
ked
ima
ge
s
Fig. 1. Two examples of images from the Tiny Images database [1] being re-ranked byour approach, according to the probability of belonging to the categories “pony” and“turboprop” respectively. No training labels were available for either class. However64,185 images from the total of 80 million were labeled, spread over 386 classes, someof which are semantically close to the two categories. Using these labels in our semanticlabel sharing scheme, we can dramatically improve search quality.
To illustrate the challenge of obtaining high quality labels in the scenario of39 39
many categories, consider the CIFAR-10 dataset constructed by Alex Krizhevsky40 40
and Geoff Hinton [8]. This dataset provides human labels for a subset of the41 41
Tiny Images [1] dataset which was obtained by querying Internet search engines42 42
with over 70,000 search terms. To construct the labels, Krizhevsky and Hinton43 43
chose 10 classes “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”,44 44
“horse”, “ship”, “truck”, and for each class they used the WordNet hierarchy to45 45
construct a set of hyponyms. The labelers were asked to examine all the images46 46
which were found with a search term that is a hyponym of the class. As an47 47
example, some of the hyponyms of ship are “cargo ship”, “ocean liner”, and48 48
“frigate”. The labelers were instructed to reject images which did not belong49 49
to their assigned class. Using this procedure, labels on a total of 386 categories50 50
(hyponyms of the 10 classes listed above) were collected at a cost of thousands51 51
of dollars.52 52
Despite the high cost of obtaining these labels, the 386 categories are of53 53
course a tiny subset of the possible labels in the English language. Consider54 54
for example the words “pony” and “turboprop” (Fig. 1). Neither of these is55 55
considered a hyponym of the 10 classes mentioned above. Yet there is obvious56 56
information in the labeled data for “horse” and “airplane” that we would like to57 57
use to improve the search engine results of “pony” and “turboprop”.58 58
In this paper, we provide a very simple method for sharing labels between59 59
categories. Our approach is based on a basic assumption – we expect the clas-60 60
sifier output for a single category to degrade gracefully with semantic distance.61 61
In other words, although horses are not exactly ponies, we expect a classifier62 62
ECCV-10 submission ID 747 3
for “pony” to give higher values for “horses” than to “airplanes”. Our scheme,63 63
which we call “Semantic Label Sharing” gives the performance shown in Fig. 1.64 64
Even though we have no labels for “pony” and “turboprop” specifically, we can65 65
significantly improve the performance of search engines by using label sharing.66 66
1.1 Related Work67 67
Various recognition approaches have been applied to Internet data, with the aim68 68
of re-ranking, or refining the output of image search engines. These include: Li69 69
et al. [9], Fergus et al. [10], Berg et al. [11], Vijayanarasimhan et al. [12], amongst70 70
others. Our approach differs in two respects: (i) these approaches treat each class71 71
independently; (ii) they are not designed to scale to the billions of images on the72 72
web.73 73
Sharing information across classes is a widely explored concept in vision and74 74
learning, and takes many different forms. Some of the first approaches applied75 75
to object recognition are based on neural networks in which sharing is achieved76 76
via the hidden layers which are common across all tasks [13, 14]. Error correct-77 77
ing output codes[15] also look at a way of combining multi-class classifiers to78 78
obtain better performance. Another set of approaches tries to transfer informa-79 79
tion from one class to another by regularizing the parameters of the classifiers80 80
across classes. Torralba et al. , Opelt et al. [16, 17] demonstrated its power in81 81
sharing useful features between classes within a boosting framework. Other ap-82 82
proaches transfer information across object categories by sharing a common set83 83
of parts [18, 19], by sharing transformations across different instances [20–22],84 84
or by sharing a set of prototypes [23]. Common to all those approaches is that85 85
the experiments are always performed with relatively few classes. Furthermore,86 86
it is not clear how these techniques would scale to very large databases with87 87
thousands of classes.88 88
Our sharing takes a different form to these approaches, in that we impose89 89
sharing on the class labels themselves, rather than in the features or parameters90 90
of the model. As such, our approach has the advantage that it it is independent91 91
of the choice of the classifier.92 92
2 Semantic Label Sharing93 93
In order to compute the semantic distance between two classes we use a tree94 94
defined by Wordnet (see Fig. 3(c))1. The semantic distance Sij between classes95 95
i and j (which are nodes in the tree) is defined as the number of nodes shared96 96
by their two parent branches, divided by the length of the longest of the two97 97
branches. For instance, the semantic similarity between a “felis domesticus”98 98
and “tabby cat” is 0.93, while the distance between “felis domesticus” and a99 99
“tractor trailer” is 0.21. We construct a sparse semantic affinity matrix A =100 100
1 Wordnet is graph-structured and we convert it into a tree by taking the most commonsense of a word.
4 ECCV-10 submission ID 747
exp(−κ(1− S)), with κ = 10 for all the experiments in this paper. For the class101 101
“airbus”, the nearest semantic classes are: “airliner” (0.49), “monoplane” (0.24),102 102
“dive bomber” (0.24), “twinjet” (0.24), “jumbo jet” (0.24), and “boat” (0.03).103 103
A visualization of A and a closeup are shown in Fig. 3(a) and (b).104 104
Let us assume we have a total of C classes, hence A will be a C×C symmetric105 105
matrix. We are given L labeled examples in total, distributed over these C106 106
classes. The labels for class c are represented by a binary vector yc of length L107 107
which has values 1 for positive hand-labeled examples and 0 otherwise. Hence108 108
positive examples for class c are regarded as negative labels for all other classes.109 109
Y = {y1, . . . , yC} is an N × C matrix holding the label vectors from all classes.110 110
We share labels between classes by replacing Y with Y A. This simple oper-111 111
ation has a number of effects:112 112
– Positive examples are copied between classes, weighted according to their113 113
semantic affinity. For example, the label vector for “felis domesticus” previ-114 114
ously had zero values for the images of “tabby cat”, but now these elements115 115
are replaced by the value 0.93.116 116
– However, labels from unrelated classes will only deviate slightly from their117 117
original state of 0 (dependent on the value of κ).118 118
– Negative labeled examples from classes outside the set of C are unaffected119 119
by A (since they are 0 across all rows of Y ).120 120
– Even if each class has only a few labeled examples, the multiplication by A121 121
will effectively pool examples across semantically similar classes, dramati-122 122
cally increasing the number that can be used for training, provided seman-123 123
tically similar classes are present amongst the set of C.124 124
The effect of this operation is illustrated in two examples on toy data, shown125 125
in Fig. 2. These examples show good classifiers can be trained by sharing labels126 126
between classes, given knowledge of the inter-class affinities, even when no labels127 127
are given for the target class. In Fig. 2, there are 9 classes but label data is only128 128
given for 7 classes. In addition to the labels, the system also has access to the129 129
affinities among the 9 classes. This information is enough to build classification130 130
functions for the classes with no labels (Fig. 2(d) and (f)).131 131
From another perspective, our sharing mechanism turns the original classifi-132 132
cation problem into a regression problem: the formerly binary labels in Y become133 133
real-values in Y A. As such we can adapt many types of classifiers to minimize134 134
regression error rather than classification error.135 135
3 Sharing in Semi-Supervised Learning136 136
Semi-supervised learning is an attractive option in settings where very few train-137 137
ing examples exist since the density of the data can be used to regularize the138 138
solution. This can help prevent over-fitting the few training examples and yield139 139
superior solutions. A popular class of semi-supervised algorithms are based on140 140
the graph Laplacian and we use an approach of this type.141 141
ECCV-10 submission ID 747 5
f function
(d) 0
0.25
0.5
0.75
1
Data
(a)
Weighted training points
(c)Labeled examples
(b)
1 2
4 6
7 8 9
3
0
0.25
0.5
0.75
1
(f )
f function5
(e)
Weighted training points
Fig. 2. Toy data illustrating our sharing mechanism between 9 different classes (a) indiscrete clusters. For 7 of the 9 classes, a few examples are labeled (b). No labels existfor the classes 3 and 5. (c): Labels re-weighted by affinity to class 3. (Red=high affinity,Blue=low affinity). (d): This plot shows the semi-supervised learning solution fclass=3
using weighted labels from (c). The value of the function fclass=3 on each sample from(a) is color coded. Dark red corresponds to the samples more likely to belong to class 3.(e): Labels re-weighted by affinity to class 5. (d): Solution of semi-supervised learningsolution fclass=5 using weighted labels from (e).
entity
physical entity
physical object animate thing
being
fauna
chordate
craniate
mammalian
eutherian mammal
carnivore
canid
Canis familiaris
felid
true cat
Felis catus toy
Japanese spaniel
hoofed mammal
artiodactyl mammal
ruminant
cervid
unit
artefact
instrumentation
transport
vehicle
craft aircraft heavier
air craft
plane
airliner
airbus
watercraft
boat
attack aircraft
tabby cat
wheeled vehicle
self propelled vehicle
automobile
Alces
automotive vehicle
Peke
perissodactyl mammal
equid
Equus caballus
mount
quarter horse
bomber
bird
Maltese
flightless bird
Emu
stud mare
compact car
amphibian
salientian
ranid
mutt
true toad
fighter aircraft
fire truck
motortruck
aerial ladder truck
mouser
ship
cargo vessel
dump truck powerboat
speedboat
Appaloosa
toy spaniel
King
Charles spaniel
passeriform bird
bird
Prunella
modularis
jet propelled
plane
twinjet
English
toy spaniel
Blenheim spaniel
coupe
ostrich
jumbo
jet
merchant
ship
moving van
car
jetliner
container vessel
wagtail
offspring
young mammal
puppy
wagon
station wagon
motorcar
shooting brake
passenger ship
patrol car
tomcat
horse
stealth fighter
estate car
true sparrow
camion
Capreolus
truck
delivery truck
tipper truck
garbage truck
stallion
motorcar
lorry
police cruiser
tractor trailer
20 40 60 80 100 120
20
40
60
80
100
120 0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
true cat
gelding
motorcar
camion
cargo ship
lippizaner
patrol car
stealthbomber
jetliner
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
true
ca
t
ge
ldin
g
mo
torc
ar
ca
mio
n
ca
rgo
sh
ip
lipp
iza
ne
r
pa
trol c
ar
ste
alth
bo
mb
er
jetlin
er
1
a) b)
c)
van
va
n
Fig. 3. Wordnet sub-tree for a subset of 386 classes used in our experiments. Theassociated semantic affinity matrix A is shown in (a), along with a closeup of 10randomly chosen rows and columns in (b).
We briefly describe semi-supervised learning in a graph setting. In addition142 142
to the L labeled examples (Xl, Yl) = {(x1, y1), ..., (xL, yL)} introduced above,143 143
we have an additional U unlabeled images Xu = {xL+1, ..., xN}, for a total144 144
of N images. We form a graph where the vertices are the images X and the145 145
edges are represented by an N × N matrix W . The edge weighting is given146 146
by Wij = exp(−‖xi − xj‖2/2ǫ2), the visual affinity between images i and j.147 147
Defining D = diag(∑
j Wij), we define the normalized graph Laplacian to be:148 148
L = I = D−1/2WD−1/2. We use L to measure the smoothness of solutions over149 149
the data points, desiring solutions that agree with the labels but are also smooth150 150
6 ECCV-10 submission ID 747
with respect to the graph. In the single class case we want to minimize:151 151
J(f) = fT Lf +l∑
i=1
λ(fi − yi)2 = fT Lf + (f − y)T Λ(f − y) (1)
where Λ is a diagonal matrix whose diagonal elements are Λii = λ if i is a152 152
labeled point and Λii = 0 for unlabeled points. The solution is given by solving153 153
the N × N linear system (L + Λ)f = Λy.154 154
This system is impractical to solve for large N , thus it is common [24–26]155 155
to reduce the dimension of the problem by using the smallest k eigenvectors156 156
of L (which will be the smoothest) U as a basis with coefficients α: f = Uα.157 157
Substituting into Eqn. 1, we find the optimal coefficients α to be the solution of158 158
the following k × k system:159 159
(Σ + UT ΛU)α = UT Λy (2)
where Σ is a diagonal matrix of the smallest k eigenvectors of L. While this160 160
system is easy to solve, the difficulty is computing the eigenvectors an O(N2)161 161
operation.162 162
Fergus et al. [27] introduced an efficient scheme for computing approximate163 163
eigenvectors in O(N) time. This approach proceeds by first computing numerical164 164
approximations to the eigenfunctions (the limit of the eigenvectors as N → ∞).165 165
Then approximations to the eigenvectors are computed via a series of 1D interpo-166 166
lations into the numerical eigenfunctions. The resulting approximate eigenvectors167 167
(and associated eigenvalues) can be used in place of U and Σ in Eqn. 2.168 168
Extending the above formulations to the multi-class scenario is straightfor-169 169
ward. In a multi-class problem, the labels will be held in an N ×C binary matrix170 170
Y , replacing y in Eqn. 2. We then solve for the N × C matrix F using the ap-171 171
proach of Fergus et al. Utilizing the semantic sharing from Section 2 is simple,172 172
with Y being replaced with Y A.173 173
4 Experiments174 174
We evaluate our sharing framework on two tasks: (a) improving the performance175 175
of images returned by Internet search engines; (b) object classification. Note176 176
that the first problem consists of a set of 2-class problems (e.g. sort the pony177 177
images from the non-pony images), while the second problem is a multi-class178 178
classification with many classes.179 179
These tasks are performed on three datasets linked to the Tiny Images180 180
database [1], a diverse and highly variable image collection downloaded from181 181
the Internet:182 182
– CIFAR: This consists of 63,000 images from 126 classes selected2 from the183 183
CIFAR-10 dataset [8], which is a hand-labeled sub-set of the Tiny Images.184 184
2 The selected classes were those that had at least 200 positive labels and 300 negativelabels, to enable accurate evaluation.
ECCV-10 submission ID 747 7
These keywords and their semantic relationship to one another are shown in185 185
Fig. 3. For each keyword, we randomly choose a fixed test-set of 75 positive186 186
and 150 negative examples, reflecting the typical signal-to-noise ratio found187 187
in images from Internet search engines. From the remaining images for each188 188
class, we randomly draw a validation set of 25/50 +ve/-ve examples. The189 189
training examples consist of +ve/-ve pairs drawn from the remaining pool190 190
of 100 positive/negative images for each keyword.191 191
– Tiny: The whole Tiny Images dataset, consisting of 79,302,017 images dis-192 192
tributed over 74,569 classes (keywords used to download the images from193 193
the Internet). No human-provided labels are available for this dataset, thus194 194
instead we use the noisy labels from the image search engines. For each class195 195
we assume the first 5 images to be true positive examples. Thus over the196 196
dataset, we have a total of 372,845 (noisy) positive training examples, and197 197
the same number of negative examples (drawn at random). For evaluation,198 198
we can use labeled examples from either the CIFAR or High-res datasets.199 199
– High-res: This is a sub-set of 10,957,654 images from the Tiny Images, for200 200
which the high-resolution original image exists. These images span 53,564201 201
different classes, distributed evenly over all classes within the Tiny Images202 202
dataset. As with the Tiny dataset, we use no hand-labeled examples for203 203
training, instead using the first 5 examples for each class as positive exam-204 204
ples (and 5 negative drawn randomly). For evaluation, we use 5,357 human-205 205
labeled images split into 2,569 and 2,788 positive and negative examples of206 206
each class respectively.207 207
Pre-processing: For all datasets, each image is represented by a single Gist208 208
descriptor. In the case of the Tiny and CIFAR datasets, a 384-D descriptor is209 209
used which is then mapped down to 32 and 64 dimensions using PCA, for Tiny210 210
and CIFAR respectively. For the High-res dataset, a 512-D Gist descriptor is211 211
mapped down to 48-D using PCA.212 212
4.1 Re-ranking experiments213 213
On the re-ranking task we first use the CIFAR dataset to quantify the effects of214 214
semantic sharing. For each class separately we train a classifier on the training215 215
set (possibly using sharing) and use it to re-rank the 250 test images, measuring216 216
the precision at 15% recall. Unless otherwise stated, the classifier used is the217 217
semi-supervised approach of Fergus et al. [27].218 218
In Fig. 4(left) we explore the effects of semantic sharing, averaging perfor-219 219
mance over all 126 classes. The validation set is used to automatically select220 220
the optimal values of κ and λ. The application of the Wordnet semantic affinity221 221
matrix can be seen to help performance. If the semantic matrix is randomly222 222
permuted (but with the diagonal fixed to be 1), then this is somewhat worse223 223
than not using sharing. But if the sharing is inverted (by replacing A with 1−A224 224
and setting the diagonal to 1), it clearly hinders performance. The same pattern225 225
of results can be see in Fig. 4(right) for a nearest neighbor classifier. Hence the226 226
8 ECCV-10 submission ID 747
−Inf 0 1 2 3 4 5 6 7 0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Log2 number of training pairs
Mea
n pr
ecis
ion
at 1
5% r
ecal
l a
vera
ged
over
126
cla
sses
Semi−supervised Learning
Wordnet
None
Random
Inverse
−Inf 0 1 2 3 4 5 6 7 0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Log2 number of training pairs
Mea
n pr
ecis
ion
at 1
5% r
ecal
l a
vera
ged
over
126
cla
sses
Nearest Neighbors
Wordnet
None
Random
Inverse
Fig. 4. Left: Performance for different sharing strategies with the semi-supervised learn-ing approach of [27] as the number of training examples is increased, using 126 classesin the CIFAR dataset. Right: As for (left) but with a nearest neighbor classifier. Theblack dashed line indicates chance level performance. When the Wordnet matrix isused for sharing it gives a clear performance improvement (red) to both methods overno sharing [27] (green). However, if the semantic matrix does not reflect the similaritybetween classes, then it hinders performance (e.g. random (blue) and inverse (magenta)curves).
semantic matrix must reflect the relationship between classes if it is to be ef-227 227
fective. In Fig. 5 we show examples of the re-ranking, using the semi-supervised228 228
learning scheme in conjunction with the Wordnet affinity matrix.229 229
In Fig. 6(left & middle), we perform a more systematic exploration of the230 230
effects of Wordnet sharing. For these experiments we use fixed values of κ = 5231 231
and λ = 1000. Both the number of classes and number of images are varied, and232 232
the performance recorded with and without the semantic affinity matrix. The233 233
sharing gives a significant performance boost, particularly when few training234 234
examples are available.235 235
The sharing behavior can be used to effectively learn classes for which we236 236
have zero training examples. In Fig. 7, we explore what happens when we allocate237 237
0 training images to one particular class (the left-out class) from the set of 126,238 238
while using 100 training pairs for the remaining 125 classes. When the sharing239 239
matrix is not used, the performance of the left-out class drops significantly,240 240
relative to its performance when training data is available (i.e. the point for241 241
each left-out class falls below the diagonal). But when sharing is used, the drop242 242
in performance is relatively small, with points being spread around the diagonal.243 243
Motivated by Fig. 7, we show in Fig. 1 the approach applied to the Tiny244 244
dataset, using the human-provided labels from the CIFAR dataset. However, no245 245
CIFAR labels exist for the two classes selected (Pony, Turboprop). Instead, we246 246
used the Wordnet matrix to share labels from semantically similar classes for247 247
which labels do exist. The qualitatively good results demonstrated in Fig. 1 can248 248
only be obtained relatively close to the 126 keywords for which we have labels.249 249
ECCV-10 submission ID 747 9
Airbus
Japanese
Spaniel Ostrich Deer Fire truck Appaloosa
Honey
Eater
Init
ial o
rde
rO
utp
ut
ord
er
Fig. 5. Test images from 7 keywords drawn from the 126 class CIFAR dataset. Theborder of each image indicates its label (used for evaluation purposes only) with respectto the keyword, green = +ve, red = -ve. The top row shows the initial ranking of thedata, while the bottom row shows the re-ranking of our approach trained on 126 classeswith 100 training pairs/classes.
Log2 # classes
# tr
aini
ng p
airs
2 3 4 5 6 7
1
2
3
5
8
10
15
20
40
60
1000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Log2 # classes
2 3 4 5 6 7
1
2
3
5
8
10
15
20
40
60
1000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Pre
cisi
on
No Sharing
Semantic Sharing
Fig. 6. Left & Middle: The variation in precision for the semi-supervised approach asthe number of training examples is increased, using 126 classes with (left) and without(middle) Wordnet sharing. Note the improvement in performance for small numbers oftraining examples when the Wordnet sharing matrix is used. Right: Evaluation of oursharing scheme for the re-ranking task on the 10 million image High-res dataset, using5,357 test examples. Our classifier was trained using 0 hand-labeled examples and 5noisy labels per class. Using a Wordnet semantic affinity matrix over the 53,564 classesgives a clear boost to performance.
10 ECCV-10 submission ID 747
0.2 0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Using semantic sharing
Performance using 100 training pairs
Pe
rfo
rma
nce
usi
ng
0 t
rain
ing
pa
irs
(an
d a
ll o
the
r cl
ass
es
ha
ve
10
0 t
rain
ing
pa
irs)
0.2 0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1No semantic sharing
Performance using 100 training pairs
Pe
rfo
rma
nce
usi
ng
0 t
rain
ing
pa
irs
(an
d a
ll o
the
r cl
ass
es
ha
ve
10
0 t
rain
ing
pa
irs)
Fig. 7. An exploration of the performance with 0 training examples for a single class,if all the other classes have 100 training pairs. Left: By using the sharing matrix A, wecan obtain a good performance by transferring labels from semantically similar classes.Right: Without it, the performance drops significantly.
This performance gain obtained by Wordnet sharing is quantified in a large-250 250
scale setting in Fig. 6(right) using the High-res dataset. Chance level perfor-251 251
mance corresponds to 2569/(2569+2788) = 48%. Without any sharing, the semi-252 252
supervised scheme (blue) gives a modest performance. But when the Wordnet253 253
sharing is added, there is significant performance boost.254 254
Our final re-ranking experiment applies the semantic sharing scheme to the255 255
whole of the Tiny dataset (with no CIFAR labels used). With 74,569 classes,256 256
many will be very similar visually and our sharing scheme can be expected to257 257
greatly assist performance. In Fig. 11 we show qualitative results for 4 classes.258 258
The semi-supervised algorithm takes around 0.1 seconds to perform each re-259 259
ranking (since the eigenfunctions are precomputed), compared to over 1 minute260 260
for the nearest-neighbor classifier. These figures show qualitatively that the semi-261 261
supervised learning scheme with semantic sharing clearly improves search per-262 262
formance over the original ranking and that without the sharing matrix the263 263
performance drops significantly.264 264
4.2 Classification experiments265 265
Classification with many classes is extremely challenging. For example, picking266 266
the correct class out of 75,000 is something that even humans typically cannot267 267
do. Hence instead of using standard metrics, we measure how far the predicted268 268
class is from the true class, as given by the semantic distance matrix S. Under269 269
this measure the true class has distance 0, while 1 indicates total dissimilarity.270 270
Fig. 8 illustrates this metric with two example images and a set of samples271 271
varying in distance from them.272 272
We compare our semantic sharing approach in the semi-supervised learning273 273
framework of [27] to two other approaches: (i) linear 1-vs-all SVM; (ii) the hier-274 274
archical SVM approach of Marszalek and Schmid [28]. The latter method uses275 275
the semantic relationships between classes to construct a hierarchy of SVMs. In276 276
ECCV-10 submission ID 747 11
AphroditeTolkienChocolateice creamJetLocomotiveUmbrellaTricornRaincoat
Longpants Miniskirt
0.700.40
Semantic distance to “Long pants”
0.300 0.20 0.620.50 0.910.800.11
Rosadamascena Jasmine Fig−bird
Blackraspberry Napoleon Croissant
Swisspine
Commonbean
FireAlarm
Chinarose
0.1 0.18
Semantic distance to “China rose”
0.47 0.910.820.620.33 0.640 0.38
Fig. 8. Our semantic distance performance metric for two examples “Long pants” and“China rose”. The other images are labeled with their semantic distance to the twoexamples. Distances under 0.2 correspond to visual similar objects.
implementing this approach, we use the same Wordnet tree structure from which277 277
the semantic distance matrix S is derived. At each edge in the tree, we train a278 278
linear SVM in the manner described in [28]. Note that both our semantic sharing279 279
method and that of Marszalek and Schmid are provided with the same semantic280 280
information. Hence, by comparing the two approaches we can see which makes281 281
more efficient use of the semantic information.282 282
These three approaches are evaluated on the CIFAR and High-res datasets in283 283
Figures 9 and 10 respectively. The latter dataset also shows the semi-supervised284 284
scheme without sharing. The two figures show consistent results that clearly285 285
demonstrate: (i) the addition of semantic information helps – both the H-SVM286 286
and SSL with sharing beat the methods without it; (ii) our sharing framework287 287
is superior to that of Marszalek and Schmid [28].288 288
5 Summary and future work289 289
We have introduced a very simple mechanism for sharing training labels between290 290
classes. Our experiments on a variety of datasets demonstrate that it gives signif-291 291
icant benefits in situations where there are many classes, a common occurrence292 292
in large image collections. We have shown how semantic sharing can be com-293 293
bined with simple classifiers to operate on large datasets up to 75,000 classes294 294
and 79 million images. Furthermore, our experiments clearly demonstrate that295 295
our sharing approach outperforms other methods that use semantic information296 296
when constructing the classifier. While the semantic sharing matrix from Word-297 297
net has proven effective, a goal of future work would be to learn it directly from298 298
the data.299 299
12 ECCV-10 submission ID 747
0 1 2 3 4 5 6 70.25
0.3
0.35
0.4
0.45
Log2 number of training pairs per class
Mea
n se
man
tic d
ista
nce
from
true
cla
ss
SVMH−SVMSharing SSL
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
Semantic distance from true class
Pro
babi
lity
dens
ity
SVMH−SVMSharing SSLChance
Fig. 9. Comparison of approaches for classification on the CIFAR dataset. Red: 1 vs alllinear SVM; Green: Hierarchical SVM approach of Marszalek and Schmid [28]; Blue:Our semantic sharing scheme in the semi-supervised approach of [27]; Black: Chance.Left: Mean semantic distance of test examples to true class as the number of labeledtraining examples increases (smaller is better). Right: For 100 training examples perclass, the distribution of distances for the positive test examples. Our sharing approachhas a significantly lower mean semantic distance, with a large mass at a distance < 0.2,corresponding to superior classification performance. See Fig. 8 for an illustration ofsemantic distance.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
Semantic distance from true class
Pro
ba
bili
ty d
en
sity
Random choice
SVM
H−SVM
Sharing SSL
Non sharing SSL
Random Chance SVM SSL H-SVM Sharing SSL0.5
0.55
0.6
0.65
0.7
0.75
0.8Average semantic distance from true class
Fig. 10. Comparison of approaches for classification on the High-res dataset. Red: 1vs all linear SVM; Green: Hierarchical SVM approach of Marszalek and Schmid [28];Magenta: the semi-supervised scheme of [27]; Blue: [27] with our semantic sharingscheme; Black: Random chance. Left: Bar chart showing mean semantic distance fromtrue label on test set. Right: The distribution of distances for each method on the testset. Our approach has more mass at a distance < 0.2, indicating superior performance.
ECCV-10 submission ID 747 13
Raw Images SSL, no sharing SSL, Wordnet sharing NN, Wordnet sharing
Po
ny
Ra
bb
ite
ye B
lue
be
rry
Na
po
leo
nP
on
d L
ily
Fig. 11. Sample results of our semantic label sharing scheme on the Tiny dataset (79million images). 0 hand-labeled training examples were used. Instead, the first 5 im-ages of each of the 74,569 classes were taken as positive examples. Using these labels,classifiers were trained for 4 different query classes: “pony”, “rabbiteye blueberry”,“Napoleon” and “pond lily”. Column 1: the raw image ranking from the Internetsearch engine. Column 2: re-ranking using the semi-supervised scheme without seman-tic sharing. Column 3: re-ranking with semi-supervised scheme and semantic sharing.Column 4: re-ranking with a nearest-neighbor classifier and semantic sharing. Withoutsemantic sharing, the classifier only has 5 positive training examples, thus performspoorly. But with semantic sharing it can leverage the semantically close examples fromthe pool of 5*74,569=372,845 positive examples.
14 ECCV-10 submission ID 747
References300 300
1. Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large database301 301
for non-parametric object and scene recognition. IEEE PAMI 30 (2008) 1958–1970302 302
2. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: a database303 303
and web-based tool for image annotation. IJCV 77 (2008) 157–173304 304
3. van Ahn, L.: The ESP game (2006)305 305
4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-306 306
Scale Hierarchical Image Database. In: CVPR09. (2009)307 307
5. : Oxford english dictionary (2009)308 308
6. Fellbaum, C.: Wordnet: An Electronic Lexical Database. Bradford Books (1998)309 309
7. Biederman, I.: Recognition-by-components: A theory of human image understand-310 310
ing. Psychological Review 94 (1987) 115–147311 311
8. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical312 312
report, University of Toronto (2009)313 313
9. Li, L.J., Wang, G., Fei-Fei, L.: Imagenet. In: CVPR. (2007)314 314
10. Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from315 315
google’s image search. In: ICCV. Volume 2. (2005) 1816–1823316 316
11. Berg, T., Forsyth, D.: Animals on the web. In: CVPR. (2006) 1463–1470317 317
12. Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: Multiple-318 318
instance learning for weakly supervised object categorization. In: CVPR. (2008)319 319
13. Caruana, R.: Multitask learning. Machine Learning 28 (1997) 41–75320 320
14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to321 321
document recognition. Proc. IEEE 86 (1998) 2278–2324322 322
15. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via ECOCs.323 323
JAIR 2 (1995) 263–286324 324
16. Torralba, A., Murphy, K., Freeman, W.: Sharing features: efficient boosting pro-325 325
cedures for multiclass object detection. In: Proc. of the 2004 IEEE CVPR. (2004)326 326
17. Opelt, A., Pinz, A., Zisserman, A.: Incremental learning of object detectors using327 327
a visual shape alphabet. In: CVPR (1). (2006) 3–10328 328
18. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE329 329
Transactions on Pattern Analysis and Machine Intelligence (To appear - 2004)330 330
19. Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Learning hierarchical models331 331
of scenes, objects, and parts. In: Proceedings of the IEEE International Conference332 332
on Computer Vision, Beijing. (2005) To appear333 333
20. Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear mod-334 334
els. Neural Computation 12 (2000) 1247–1283335 335
21. Bart, E., Ullman, S.: Cross-generalization: learning novel classes from a single336 336
example by feature replacement. In: CVPR. (2005)337 337
22. Miller, E., Matsakis, N., Viola, P.: Learning from one example through shared338 338
densities on transforms. In: CVPR. Volume 1. (2000) 464–471339 339
23. A. Quattoni, M.C., Darrell, T.: Transfer learning for image classification with340 340
sparse prototype representations. In: CVPR. (2008)341 341
24. Chapelle, O., Scholkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press (2006)342 342
25. Schoelkopf, B., Smola, A.: Learning with Kernels Support Vector Machines, Reg-343 343
ularization, Optimization, and Beyond. MIT Press, (2002)344 344
26. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian345 345
fields and harmonic functions. In: In ICML. (2003) 912–919346 346
27. Fergus, R., Weiss, Y., Torralba, A.: Semi-supervised learning in gigantic image347 347
collections. In: NIPS. (2009)348 348
28. Marszalek, M., Schmid, C.: Semantic hierarchies for visual object recognition. In:349 349
CVPR. (2007)350 350