zero-shot learning - the good, the bad and the...

Zero-Shot Learning - The Good, the Bad and the UglyYongqin Xian1, Bernt Schiele1, Zeynep Akata1,2

1 Max-Planck Institute for Informatics 2 University of Amsterdam

Zero-Shot Evaluation Protocol, Evaluated Methods and Evaluation Metrics

Ytr

Training time

Yts Yts ∪ Ytr

Zero-shot Learning Generalized Zero-Shot Learning

Test time

polar bear

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

zebra


otter


tiger

black: yes white : yesbrown: nostripes: yeswater: noeats fish: no

otter


tiger

black: yes white : yesbrown: nostripes: yeswater: noeats fish: no

polar bear


zebra


The Good: Zero-Shot learning attracts lots of attentionThe Bad: No agreed evaluation protocolThe Ugly: Test classes overlap ImageNet 1K

Dataset Size |Y | |Y tr| |Y ts|SUN 14K 717 580 + 65 72CUB 11K 200 100 + 50 50AWA1 30K 50 27 + 13 10AWA2 [11] 37K 50 27 + 13 10aPY 1.5K 32 15 + 5 12

ImageNet Split |Y ts|ImageNet 21K - Y tr 20345Within 2/3 hops from Y tr 1509/7678Most populated classes 500/1K/5K

Least populated classes 500/1K/5K

Group Method Main Idea

Linear

ALE [1] Learn linear embedding by weighted ranking loss

compatibility

DEVISE [4] Learn linear embedding by ranking lossSJE [2] Learn linear embedding by multi-class SVM lossESZSL [8] Apply square loss and regularize the embeddingSAE [5] Learn linear embedding with autoencoder

Non-linear LATEM [10] Learn piece-wise linear embeddingcompatibility CMT [9] Learn non-linear embedding by neural network

Two-stageDAP [6] Predict attributes→ unseen class

inferenceIAP [6] Predict seen class→ attributes→ unseen classCONSE [7] Predict seen class→embedding→ unseen class

HybridSSE [12] Learn embedding by seen class proportionsSYNC [3] Learn base classifiers→unseen class classifiers

Per-class Top-1 Accuracy:

accY = 1‖Y‖

‖Y‖∑c=1

# correct in c# in c

Harmonic Mean:

H = 2 ∗ accY tr ∗ accY ts

accY tr + accY ts

Propose a unified evaluation protocol and data splitsCompare and analyze a significant number of the state-of-the-art methods in depthZero-shot setting & more realistic generalized zero-shot setting

Zero-Shot Learning Setting

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

20

40

60

80

Top

-1 A

cc. (

in %

)

PS vs SS

SSPS

• Proposed Split (PS) vs Standard Split (SS) on AWA1• 6 out of 10 test classes of SS appear in ImageNet 1K• This violates zero-shot setting: Feature learning = part of training• PS test sets are not a part of ImageNet 1K

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

SUN

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

CUB

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA1

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA2

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

APY

DAPIAPCONSECMTSSELATEMALEDEVISESJEESZSLSYNCSAE

52411

2

5422

11

22331

2

2

14

2421

1

11

67

123123

21

15

1323

21

133221

1

4

73

141

9

2

562

3

12

1 2 3 4 5 6 7 8 9 10 11 12Rank

ALE [2.4]DEVISE [3.3]

SJE [3.9]SSE [4.7]

ESZSL [4.7]LATEM [5.2]

SYNC [6.1]DAP [8.4]SAE [9.6]CMT [9.7]

CONSE [9.8]IAP [10.3]

• Top-1 accuracy of PS on SUN, CUB, AWA1 and AWA2 (above)• Methods ranking of PS on all 5 datasets (left)Element (i, j) indicates number of times model i ranks at jth

• Compatibility learning methods perform better

2H 3H M500 M1K M5K L500 L1K L5K All0

2

4

6

8

10

12

14

16

Top

-1 A

cc. (

in %

)

CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE

• Top-1 accuracy on ImageNet with different test class splits• Hybrid models perform the best for highly populated classes• All methods perform similarly for sparsely populated classes• Results on least populated class lower than most populated class

Generalized Zero-Shot Learning Setting

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

10

20

30

40

50

60

70

Top

-1 A

cc. (

in %

)

CUB

u s H

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA1

u s H

Ranking methods wrt u (left),s (right) and H (bottom)

55

23

751

2

1245

1

2

245

3

1

117311

1

21

114231

3

112421

1

2

4111321

12462

1

3112

4

3

1214

43

1

122

27

1

662

1 2 3 4 5 6 7 8 9 10 11 12 13Rank


SJE [4.8]LATEM [5.4]ESZSL [5.5]CMT* [5.8]SYNC [5.9]

SSE [8.1]SAE [9.0]CMT [9.7]IAP [10.0]

DAP [10.8]CONSE [11.5]

6341

1

711

5

1

243112

2

132144

2

33

2

21

2

22

4222

1

1

14

2

42

1

1

1

2

3

422

12

3

3141

33234

123

113112

13

6

11111

11

3

1

36

1 2 3 4 5 6 7 8 9 10 11 12 13Rank

CONSE [1.7]IAP [5.2]

SYNC [6.0]DAP [6.1]

CMT* [6.1]CMT [6.2]ALE [6.2]SSE [8.1]

ESZSL [8.4]DEVISE [8.8]LATEM [9.0]

SAE [9.0]SJE [10.1]

73

23

54321

15313

2

2336

1

1

11641

1

2

141232

32

126

1

22

4

15

1

582

1

32214

2

125

43

1

211

28

1

662

1 2 3 4 5 6 7 8 9 10 11 12 13Rank


SJE [4.7]SYNC [5.2]

LATEM [5.4]ESZSL [5.5]CMT* [5.7]

SSE [8.1]SAE [9.3]CMT [9.9]IAP [10.0]

DAP [10.8]CONSE [11.7]

• Methods with very high seen class accuracy have low unseen class accuracy• Harmonic mean is a good measure that balances seen and unseen class accuracy

Top-K accuracy on ImageNet


0.5

1

1.5

2

2.5

3

Top

-1 A

cc. (

in %

)



5

10

15

Top

-5 A

cc. (

in %

)



5

10

15

20

25

Top

-10

Acc

. (in

%)


[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. TPAMI, 2016.

[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.

[3] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016.

[4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013.

[5] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017.

[6] C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013.

[7] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.

[8] B. Romera-Paredes and P. H. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015.

[9] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS. 2013.

[10] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.

[11] Y. Xian, H. C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017.

[12] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.

zero-shot learning - the good, the bad and the...

Documents