zero-shot learning - the good, the bad and the...

1
Zero-Shot Learning - The Good, the Bad and the Ugly Yongqin Xian 1 , Bernt Schiele 1 , Zeynep Akata 1,2 1 Max-Planck Institute for Informatics 2 University of Amsterdam Zero-Shot Evaluation Protocol, Evaluated Methods and Evaluation Metrics Y tr Training time Y ts Y ts Y tr Zero-shot Learning Generalized Zero-Shot Learning Test time polar bear black: yes white : no brown: yes stripes: no water: yes eats fish: yes zebra black: yes white : no brown: yes stripes: no water: yes eats fish: yes otter black: yes white : no brown: yes stripes: no water: yes eats fish: yes tiger black: yes white : yes brown: no stripes: yes water: no eats fish: no otter black: yes white : no brown: yes stripes: no water: yes eats fish: yes tiger black: yes white : yes brown: no stripes: yes water: no eats fish: no polar bear black: yes white : no brown: yes stripes: no water: yes eats fish: yes zebra black: yes white : no brown: yes stripes: no water: yes eats fish: yes The Good: Zero-Shot learning attracts lots of attention The Bad: No agreed evaluation protocol The Ugly: Test classes overlap ImageNet 1K Dataset Size |Y | |Y tr | |Y ts | SUN 14K 717 580 + 65 72 CUB 11K 200 100 + 50 50 AWA1 30K 50 27 + 13 10 AWA2 [11] 37K 50 27 + 13 10 aPY 1.5K 32 15 + 5 12 ImageNet Split |Y ts | ImageNet 21K - Y tr 20345 Within 2/3 hops from Y tr 1509/7678 Most populated classes 500/1K/5K Least populated classes 500/1K/5K Group Method Main Idea Linear ALE [1] Learn linear embedding by weighted ranking loss compatibility DEVISE [4] Learn linear embedding by ranking loss SJE [2] Learn linear embedding by multi-class SVM loss ESZSL [8] Apply square loss and regularize the embedding SAE [5] Learn linear embedding with autoencoder Non-linear LATEM [10] Learn piece-wise linear embedding compatibility CMT [9] Learn non-linear embedding by neural network Two-stage DAP [6] Predict attributes unseen class inference IAP [6] Predict seen class attributes unseen class CONSE [7] Predict seen classembeddingunseen class Hybrid SSE [12] Learn embedding by seen class proportions SYNC [3] Learn base classifiersunseen class classifiers Per-class Top-1 Accuracy: acc Y = 1 Y Y X c=1 # correct in c # in c Harmonic Mean: H = 2 * acc Y tr * acc Y ts acc Y tr + acc Y ts Propose a unified evaluation protocol and data splits Compare and analyze a significant number of the state-of-the-art methods in depth Zero-shot setting & more realistic generalized zero-shot setting Zero-Shot Learning Setting DAP IAP CONSE CMT SSE LATEM ALE DEVISE SJE ESZSL SYNC SAE 0 20 40 60 80 Top-1 Acc. (in %) PS vs SS SS PS Proposed Split (PS) vs Standard Split (SS) on AWA1 6 out of 10 test classes of SS appear in ImageNet 1K This violates zero-shot setting: Feature learning = part of training PS test sets are not a part of ImageNet 1K 1 2 3 4 5 6 7 8 9 10 11 12 Models 0 20 40 60 80 Top-1 Acc. (in %) SUN 1 2 3 4 5 6 7 8 9 10 11 12 Models 0 20 40 60 80 Top-1 Acc. (in %) CUB 1 2 3 4 5 6 7 8 9 10 11 12 Models 0 20 40 60 80 Top-1 Acc. (in %) AWA1 1 2 3 4 5 6 7 8 9 10 11 12 Models 0 20 40 60 80 Top-1 Acc. (in %) AWA2 DAP IAP CONSE CMT SSE LATEM ALE DEVISE SJE ESZSL SYNC SAE 5 2 4 1 1 2 5 4 2 2 1 1 2 2 3 3 1 2 2 1 4 2 4 2 1 1 1 1 6 7 1 2 3 1 2 3 2 1 1 5 1 3 2 3 2 1 1 3 3 2 2 1 1 4 7 3 1 4 1 9 2 5 6 2 3 12 1 2 3 4 5 6 7 8 9 10 11 12 Rank ALE [2.4] DEVISE [3.3] SJE [3.9] SSE [4.7] ESZSL [4.7] LATEM [5.2] SYNC [6.1] DAP [8.4] SAE [9.6] CMT [9.7] CONSE [9.8] IAP [10.3] Top-1 accuracy of PS on SUN, CUB, AWA1 and AWA2 (above) Methods ranking of PS on all 5 datasets (left) Element (i, j ) indicates number of times model i ranks at j th Compatibility learning methods perform better 2H 3H M500 M1K M5K L500 L1K L5K All 0 2 4 6 8 10 12 14 16 Top-1 Acc. (in %) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE Top-1 accuracy on ImageNet with different test class splits Hybrid models perform the best for highly populated classes All methods perform similarly for sparsely populated classes Results on least populated class lower than most populated class Generalized Zero-Shot Learning Setting DAP IAP CONSE CMT SSE LATEM ALE DEVISE SJE ESZSL SYNC SAE 0 10 20 30 40 50 60 70 Top-1 Acc. (in %) CUB u s H DAP IAP CONSE CMT SSE LATEM ALE DEVISE SJE ESZSL SYNC SAE 0 20 40 60 80 Top-1 Acc. (in %) AWA1 u s H Ranking methods wrt u (left), s (right) and H (bottom) 5 5 2 3 7 5 1 2 1 2 4 5 1 2 2 4 5 3 1 1 1 7 3 1 1 1 2 1 1 1 4 2 3 1 3 1 1 2 4 2 1 1 2 4 1 1 1 3 2 1 1 2 4 6 2 1 3 1 1 2 4 3 1 2 1 4 4 3 1 1 2 2 2 7 1 6 6 2 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank ALE [2.3] DEVISE [2.3] SJE [4.8] LATEM [5.4] ESZSL [5.5] CMT* [5.8] SYNC [5.9] SSE [8.1] SAE [9.0] CMT [9.7] IAP [10.0] DAP [10.8] CONSE [11.5] 6 3 4 1 1 7 1 1 5 1 2 4 3 1 1 2 2 1 3 2 1 4 4 2 3 3 2 2 1 2 2 2 4 2 2 2 1 1 1 4 2 4 2 1 1 1 2 3 4 2 2 1 2 3 3 1 4 1 3 3 2 3 4 1 2 3 1 1 3 1 1 2 1 3 6 1 1 1 1 1 1 1 3 1 3 6 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank CONSE [1.7] IAP [5.2] SYNC [6.0] DAP [6.1] CMT* [6.1] CMT [6.2] ALE [6.2] SSE [8.1] ESZSL [8.4] DEVISE [8.8] LATEM [9.0] SAE [9.0] SJE [10.1] 7 3 2 3 5 4 3 2 1 1 5 3 1 3 2 2 3 3 6 1 1 1 1 6 4 1 1 2 1 4 1 2 3 2 3 2 1 2 6 1 2 2 4 1 5 1 5 8 2 1 3 2 2 1 4 2 1 2 5 4 3 1 2 1 1 2 8 1 6 6 2 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank ALE [2.1] DEVISE [2.6] SJE [4.7] SYNC [5.2] LATEM [5.4] ESZSL [5.5] CMT* [5.7] SSE [8.1] SAE [9.3] CMT [9.9] IAP [10.0] DAP [10.8] CONSE [11.7] Methods with very high seen class accuracy have low unseen class accuracy Harmonic mean is a good measure that balances seen and unseen class accuracy Top-K accuracy on ImageNet 2H 3H M500 M1K M5K L500 L1K L5K All 0 0.5 1 1.5 2 2.5 3 Top-1 Acc. (in %) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 5 10 15 Top-5 Acc. (in %) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 5 10 15 20 25 Top-10 Acc. (in %) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. TPAMI, 2016. [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015. [3] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016. [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. [5] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017. [6] C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013. [7] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014. [8] B. Romera-Paredes and P. H. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015. [9] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS. 2013. [10]Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016. [11]Y. Xian, H. C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017. [12] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.

Upload: others

Post on 21-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Zero-Shot Learning - The Good, the Bad and the UglyYongqin Xian1, Bernt Schiele1, Zeynep Akata1,2

1 Max-Planck Institute for Informatics 2 University of Amsterdam

Zero-Shot Evaluation Protocol, Evaluated Methods and Evaluation Metrics

Ytr

Training time

Yts Yts ∪ Ytr

Zero-shot Learning Generalized Zero-Shot Learning

Test time

polar bear

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

zebra

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

otter

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

tiger

black: yes white : yesbrown: nostripes: yeswater: noeats fish: no

otter

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

tiger

black: yes white : yesbrown: nostripes: yeswater: noeats fish: no

polar bear

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

zebra

black: yes white : nobrown: yesstripes: nowater: yeseats fish: yes

The Good: Zero-Shot learning attracts lots of attentionThe Bad: No agreed evaluation protocolThe Ugly: Test classes overlap ImageNet 1K

Dataset Size |Y | |Y tr| |Y ts|SUN 14K 717 580 + 65 72CUB 11K 200 100 + 50 50AWA1 30K 50 27 + 13 10AWA2 [11] 37K 50 27 + 13 10aPY 1.5K 32 15 + 5 12

ImageNet Split |Y ts|ImageNet 21K - Y tr 20345Within 2/3 hops from Y tr 1509/7678Most populated classes 500/1K/5K

Least populated classes 500/1K/5K

Group Method Main Idea

Linear

ALE [1] Learn linear embedding by weighted ranking loss

compatibility

DEVISE [4] Learn linear embedding by ranking lossSJE [2] Learn linear embedding by multi-class SVM lossESZSL [8] Apply square loss and regularize the embeddingSAE [5] Learn linear embedding with autoencoder

Non-linear LATEM [10] Learn piece-wise linear embeddingcompatibility CMT [9] Learn non-linear embedding by neural network

Two-stageDAP [6] Predict attributes→ unseen class

inferenceIAP [6] Predict seen class→ attributes→ unseen classCONSE [7] Predict seen class→embedding→ unseen class

HybridSSE [12] Learn embedding by seen class proportionsSYNC [3] Learn base classifiers→unseen class classifiers

Per-class Top-1 Accuracy:

accY = 1‖Y‖

‖Y‖∑c=1

# correct in c# in c

Harmonic Mean:

H = 2 ∗ accY tr ∗ accY ts

accY tr + accY ts

Propose a unified evaluation protocol and data splitsCompare and analyze a significant number of the state-of-the-art methods in depthZero-shot setting & more realistic generalized zero-shot setting

Zero-Shot Learning Setting

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

20

40

60

80

Top

-1 A

cc. (

in %

)

PS vs SS

SSPS

• Proposed Split (PS) vs Standard Split (SS) on AWA1• 6 out of 10 test classes of SS appear in ImageNet 1K• This violates zero-shot setting: Feature learning = part of training• PS test sets are not a part of ImageNet 1K

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

SUN

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

CUB

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA1

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA2

1 2 3 4 5 6 7 8 9 10 11 12Models

0

20

40

60

80

Top

-1 A

cc. (

in %

)

APY

DAPIAPCONSECMTSSELATEMALEDEVISESJEESZSLSYNCSAE

52411

2

5422

11

22331

2

2

14

2421

1

11

67

123123

21

15

1323

21

133221

1

4

73

141

9

2

562

3

12

1 2 3 4 5 6 7 8 9 10 11 12Rank

ALE [2.4]DEVISE [3.3]

SJE [3.9]SSE [4.7]

ESZSL [4.7]LATEM [5.2]

SYNC [6.1]DAP [8.4]SAE [9.6]CMT [9.7]

CONSE [9.8]IAP [10.3]

• Top-1 accuracy of PS on SUN, CUB, AWA1 and AWA2 (above)• Methods ranking of PS on all 5 datasets (left)Element (i, j) indicates number of times model i ranks at jth

• Compatibility learning methods perform better

2H 3H M500 M1K M5K L500 L1K L5K All0

2

4

6

8

10

12

14

16

Top

-1 A

cc. (

in %

)

CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE

• Top-1 accuracy on ImageNet with different test class splits• Hybrid models perform the best for highly populated classes• All methods perform similarly for sparsely populated classes• Results on least populated class lower than most populated class

Generalized Zero-Shot Learning Setting

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

10

20

30

40

50

60

70

Top

-1 A

cc. (

in %

)

CUB

u s H

DAPIA

P

CONSECM

TSSE

LATEM

ALE

DEVISE

SJE

ESZSL

SYNCSAE

0

20

40

60

80

Top

-1 A

cc. (

in %

)

AWA1

u s H

Ranking methods wrt u (left),s (right) and H (bottom)

55

23

751

2

1245

1

2

245

3

1

117311

1

21

114231

3

112421

1

2

4111321

12462

1

3112

4

3

1214

43

1

122

27

1

662

1 2 3 4 5 6 7 8 9 10 11 12 13Rank

ALE [2.3]DEVISE [2.3]

SJE [4.8]LATEM [5.4]ESZSL [5.5]CMT* [5.8]SYNC [5.9]

SSE [8.1]SAE [9.0]CMT [9.7]IAP [10.0]

DAP [10.8]CONSE [11.5]

6341

1

711

5

1

243112

2

132144

2

33

2

21

2

22

4222

1

1

14

2

42

1

1

1

2

3

422

12

3

3141

33234

123

113112

13

6

11111

11

3

1

36

1 2 3 4 5 6 7 8 9 10 11 12 13Rank

CONSE [1.7]IAP [5.2]

SYNC [6.0]DAP [6.1]

CMT* [6.1]CMT [6.2]ALE [6.2]SSE [8.1]

ESZSL [8.4]DEVISE [8.8]LATEM [9.0]

SAE [9.0]SJE [10.1]

73

23

54321

15313

2

2336

1

1

11641

1

2

141232

32

126

1

22

4

15

1

582

1

32214

2

125

43

1

211

28

1

662

1 2 3 4 5 6 7 8 9 10 11 12 13Rank

ALE [2.1]DEVISE [2.6]

SJE [4.7]SYNC [5.2]

LATEM [5.4]ESZSL [5.5]CMT* [5.7]

SSE [8.1]SAE [9.3]CMT [9.9]IAP [10.0]

DAP [10.8]CONSE [11.7]

• Methods with very high seen class accuracy have low unseen class accuracy• Harmonic mean is a good measure that balances seen and unseen class accuracy

Top-K accuracy on ImageNet

2H 3H M500 M1K M5K L500 L1K L5K All0

0.5

1

1.5

2

2.5

3

Top

-1 A

cc. (

in %

)

CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE

2H 3H M500 M1K M5K L500 L1K L5K All0

5

10

15

Top

-5 A

cc. (

in %

)

CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE

2H 3H M500 M1K M5K L500 L1K L5K All0

5

10

15

20

25

Top

-10

Acc

. (in

%)

CONSECMTLATEMALEDEVISESJEESZSLSYNCSAE

[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. TPAMI, 2016.

[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.

[3] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016.

[4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013.

[5] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017.

[6] C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013.

[7] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.

[8] B. Romera-Paredes and P. H. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015.

[9] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS. 2013.

[10] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.

[11] Y. Xian, H. C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017.

[12] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.