zero-shot learning - the good, the bad and the...
TRANSCRIPT
Zero-Shot Learning - The Good, the Bad and the UglyYongqin Xian1, Bernt Schiele1, Zeynep Akata1,2
1 Max-Planck Institute for Informatics 2 University of Amsterdam
Zero-Shot Evaluation Protocol, Evaluated Methods and Evaluation Metrics
Ytr
Training time
Yts Yts ∪ Ytr
Zero-shot Learning Generalized Zero-Shot Learning
Test time
polar bear
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
zebra
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
otter
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
tiger
black: yes white : yesbrown: nostripes: yeswater: noeats fish: no
otter
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
tiger
black: yes white : yesbrown: nostripes: yeswater: noeats fish: no
polar bear
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
zebra
black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes
The Good: Zero-Shot learning attracts lots of attentionThe Bad: No agreed evaluation protocolThe Ugly: Test classes overlap ImageNet 1K
Dataset Size |Y | |Y tr| |Y ts|SUN 14K 717 580 + 65 72CUB 11K 200 100 + 50 50AWA1 30K 50 27 + 13 10AWA2 [11] 37K 50 27 + 13 10aPY 1.5K 32 15 + 5 12
ImageNet Split |Y ts|ImageNet 21K - Y tr 20345Within 2/3 hops from Y tr 1509/7678Most populated classes 500/1K/5K
Least populated classes 500/1K/5K
Group Method Main Idea
Linear
ALE [1] Learn linear embedding by weighted ranking loss
compatibility
DEVISE [4] Learn linear embedding by ranking lossSJE [2] Learn linear embedding by multi-class SVM lossESZSL [8] Apply square loss and regularize the embeddingSAE [5] Learn linear embedding with autoencoder
Non-linear LATEM [10] Learn piece-wise linear embeddingcompatibility CMT [9] Learn non-linear embedding by neural network
Two-stageDAP [6] Predict attributes→ unseen class
inferenceIAP [6] Predict seen class→ attributes→ unseen classCONSE [7] Predict seen class→embedding→ unseen class
HybridSSE [12] Learn embedding by seen class proportionsSYNC [3] Learn base classifiers→unseen class classifiers
Per-class Top-1 Accuracy:
accY = 1‖Y‖
‖Y‖∑c=1
# correct in c# in c
Harmonic Mean:
H = 2 ∗ accY tr ∗ accY ts
accY tr + accY ts
Propose a unified evaluation protocol and data splitsCompare and analyze a significant number of the state-of-the-art methods in depthZero-shot setting & more realistic generalized zero-shot setting
Zero-Shot Learning Setting
DAPIA
P
CONSECM
TSSE
LATEM
ALE
DEVISE
SJE
ESZSL
SYNCSAE
0
20
40
60
80
Top
-1 A
cc. (
in %
)
PS vs SS
SSPS
• Proposed Split (PS) vs Standard Split (SS) on AWA1• 6 out of 10 test classes of SS appear in ImageNet 1K• This violates zero-shot setting: Feature learning = part of training• PS test sets are not a part of ImageNet 1K
1 2 3 4 5 6 7 8 9 10 11 12Models
0
20
40
60
80
Top
-1 A
cc. (
in %
)
SUN
1 2 3 4 5 6 7 8 9 10 11 12Models
0
20
40
60
80
Top
-1 A
cc. (
in %
)
CUB
1 2 3 4 5 6 7 8 9 10 11 12Models
0
20
40
60
80
Top
-1 A
cc. (
in %
)
AWA1
1 2 3 4 5 6 7 8 9 10 11 12Models
0
20
40
60
80
Top
-1 A
cc. (
in %
)
AWA2
1 2 3 4 5 6 7 8 9 10 11 12Models
0
20
40
60
80
Top
-1 A
cc. (
in %
)
APY
DAPIAPCONSECMTSSELATEMALEDEVISESJEESZSLSYNCSAE
52411
2
5422
11
22331
2
2
14
2421
1
11
67
123123
21
15
1323
21
133221
1
4
73
141
9
2
562
3
12
1 2 3 4 5 6 7 8 9 10 11 12Rank
ALE [2.4]DEVISE [3.3]
SJE [3.9]SSE [4.7]
ESZSL [4.7]LATEM [5.2]
SYNC [6.1]DAP [8.4]SAE [9.6]CMT [9.7]
CONSE [9.8]IAP [10.3]
• Top-1 accuracy of PS on SUN, CUB, AWA1 and AWA2 (above)• Methods ranking of PS on all 5 datasets (left)Element (i, j) indicates number of times model i ranks at jth
• Compatibility learning methods perform better
2H 3H M500 M1K M5K L500 L1K L5K All0
2
4
6
8
10
12
14
16
Top
-1 A
cc. (
in %
)
CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE
• Top-1 accuracy on ImageNet with different test class splits• Hybrid models perform the best for highly populated classes• All methods perform similarly for sparsely populated classes• Results on least populated class lower than most populated class
Generalized Zero-Shot Learning Setting
DAPIA
P
CONSECM
TSSE
LATEM
ALE
DEVISE
SJE
ESZSL
SYNCSAE
0
10
20
30
40
50
60
70
Top
-1 A
cc. (
in %
)
CUB
u s H
DAPIA
P
CONSECM
TSSE
LATEM
ALE
DEVISE
SJE
ESZSL
SYNCSAE
0
20
40
60
80
Top
-1 A
cc. (
in %
)
AWA1
u s H
Ranking methods wrt u (left),s (right) and H (bottom)
55
23
751
2
1245
1
2
245
3
1
117311
1
21
114231
3
112421
1
2
4111321
12462
1
3112
4
3
1214
43
1
122
27
1
662
1 2 3 4 5 6 7 8 9 10 11 12 13Rank
ALE [2.3]DEVISE [2.3]
SJE [4.8]LATEM [5.4]ESZSL [5.5]CMT* [5.8]SYNC [5.9]
SSE [8.1]SAE [9.0]CMT [9.7]IAP [10.0]
DAP [10.8]CONSE [11.5]
6341
1
711
5
1
243112
2
132144
2
33
2
21
2
22
4222
1
1
14
2
42
1
1
1
2
3
422
12
3
3141
33234
123
113112
13
6
11111
11
3
1
36
1 2 3 4 5 6 7 8 9 10 11 12 13Rank
CONSE [1.7]IAP [5.2]
SYNC [6.0]DAP [6.1]
CMT* [6.1]CMT [6.2]ALE [6.2]SSE [8.1]
ESZSL [8.4]DEVISE [8.8]LATEM [9.0]
SAE [9.0]SJE [10.1]
73
23
54321
15313
2
2336
1
1
11641
1
2
141232
32
126
1
22
4
15
1
582
1
32214
2
125
43
1
211
28
1
662
1 2 3 4 5 6 7 8 9 10 11 12 13Rank
ALE [2.1]DEVISE [2.6]
SJE [4.7]SYNC [5.2]
LATEM [5.4]ESZSL [5.5]CMT* [5.7]
SSE [8.1]SAE [9.3]CMT [9.9]IAP [10.0]
DAP [10.8]CONSE [11.7]
• Methods with very high seen class accuracy have low unseen class accuracy• Harmonic mean is a good measure that balances seen and unseen class accuracy
Top-K accuracy on ImageNet
2H 3H M500 M1K M5K L500 L1K L5K All0
0.5
1
1.5
2
2.5
3
Top
-1 A
cc. (
in %
)
CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE
2H 3H M500 M1K M5K L500 L1K L5K All0
5
10
15
Top
-5 A
cc. (
in %
)
CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE
2H 3H M500 M1K M5K L500 L1K L5K All0
5
10
15
20
25
Top
-10
Acc
. (in
%)
CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE
[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. TPAMI, 2016.
[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.
[3] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016.
[4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
[5] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017.
[6] C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013.
[7] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
[8] B. Romera-Paredes and P. H. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015.
[9] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS. 2013.
[10] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.
[11] Y. Xian, H. C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017.
[12] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.