towardcorrelatingandsolvingabstracttasksusingconvolution...

1
Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Kuan-Chuan Peng and Tsuhan Chen [email protected], [email protected] Motivation 1. Relatively fewer CNN-related works focus on abstract tasks. 2. Different abstract tasks are treated in a relatively independent fashion in the prior literature. Summary of Findings 1. Superior performance of CNN-based approaches in abstract tasks. 2. Concatenating CNN features learned from different tasks can enhance the performance in each task. 3. Concatenating CNN features learned from all the tasks does NOT perform the best. 4. Suggestions of choosing CNN features to use in abstract tasks. Using the Features Learned from Other Tasks task ID EMO AST ART AVA FAS ARC MEM INT CAL evaluation metric accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ accuracy (%) self feature 36.228 55.148 67.555 69.423 76.957 54.440 0.398 0.573 88.217 best concatenating setting (the table below) 39.082 57.509 71.048 69.980 77.609 55.382 0.507 0.630 88.394 concatenate all features 36.971 54.596 69.210 69.458 74.348 53.489 0.504 0.629 85.969 CAL: Caltech-101 object classification task (30 training images / class); underlined : the performance better than that of using self feature task ID ART AST CAL ARC EMO AVA FAS MEM INT F ART v v v v v v F AST v v v v F CAL v v v v F ARC v v v v F EMO v v v v v v F AVA v v v v v v F FAS v v v v v v F MEM v v v v v F INT v v F T: the 4096-d feature vector output from the CNN trained with the task T where we call F T ”self feature.” task ID ART AST CAL ARC EMO AVA FAS MEM INT evaluation metric accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ F ART 1 (67.555) 2 (48.569) 4 (81.080) 4 (48.845) 7 (31.138) 5 (63.181) 4 (68.478) 3 (0.454) 5 (0.520) F AST 1 (67.555) 1 (55.148) 5 (81.066) 6 (48.320) 6 (31.139) 7 (62.915) 6 (66.739) 6 (0.442) 6 (0.508) F CAL 3 (61.121) 5 (45.907) 1 (88.217) 3 (50.411) 5 (31.886) 3 (63.297) 7 (66.739) 1 (0.464) 8 (0.493) F ARC 4 (60.938) 4 (46.760) 2 (83.667) 1 (54.440) 2 (33.868) 4 (63.277) 2 (71.739) 7 (0.434) 9 (0.487) F EMO 4 (60.938) 3 (47.564) 8 (70.771) 2 (50.629) 1 (36.228) 2 (63.392) 5 (68.043) 2 (0.459) 7 (0.497) F AVA 7 (56.985) 7 (41.487) 6 (80.475) 7 (46.824) 4 (33.372) 1 (69.423) 3 (69.565) 5 (0.445) 1 (0.575) F FAS 6 (57.813) 6 (44.751) 3 (81.514) 5 (48.697) 3 (33.748) 6 (63.006) 1 (76.957) 4 (0.450) 4 (0.542) F MEM 9 (51.838) 9 (37.167) 9 (65.643) 9 (40.986) 9 (27.170) 9 (61.004) 9 (60.870) 8 (0.398) 3 (0.560) F INT 8 (55.790) 8 (40.532) 7 (74.316) 8 (43.925) 8 (30.029) 8 (61.666) 8 (66.087) 9 (0.346) 2 (0.573) self feature + F ART n/a 3 (56.404) 4 (88.231) 2 (55.124) 1 (37.095) 6 (69.498) 2 (76.957) 7 (0.459) 6 (0.598) self feature + F AST 1 (70.129) n/a 5 (87.823) 6 (54.574) 3 (36.560) 3 (69.528) 7 (74.348) 3 (0.466) 4 (0.599) self feature + F CAL 2 (68.658) 2 (56.605) n/a 5 (54.747) 5 (36.352) 8 (69.323) 4 (76.522) 2 (0.468) 5 (0.598) self feature + F ARC 2 (68.658) 7 (55.098) 3 (88.244) n/a 7 (35.855) 5 (69.503) 5 (75.652) 5 (0.463) 7 (0.598) self feature + F EMO 4 (68.382) 1 (57.308) 6 (87.728) 1 (55.149) n/a 7 (69.473) 6 (75.435) 1 (0.470) 3 (0.599) self feature + F AVA 6 (67.739) 6 (55.349) 1 (88.319) 4 (54.990) 2 (36.601) n/a 1 (77.391) 4 (0.465) 1 (0.610) self feature + F FAS 7 (67.647) 4 (56.103) 2 (88.299) 3 (55.064) 6 (36.105) 1 (69.729) n/a 6 (0.461) 2 (0.603) self feature + F MEM 5 (67.831) 5 (55.550) 7 (86.975) 7 (53.271) 4 (36.479) 3 (69.528) 2 (76.957) n/a 8 (0.585) self feature + F INT 8 (67.096) 8 (53.541) 8 (85.834) 8 (52.359) 8 (35.361) 2 (69.594) 7 (74.348) 8 (0.417) n/a format: rank (performance); underlined : the performance better than that of using self feature References [1] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. V. Gool. The interestingness of images. In ICCV, 2013. [2] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In CVPR, 2011. [3] F. S. Khan, S. Beigpour, J. V. D. Weijer, and M. Felsberg. Painting-91: a large scale database for computational painting categorization. In Machine Vision and Applications, 2014. [4] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In NIPS, 2012. [5] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In ECCV, 2014. [6] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. RAPID: rating pictorial aesthetics using deep learning. In ACMMM, 2014. [7] J. Machajdik, and A. Hanbury. Affective image classification using features inspired by psychology and art theory. In International Conference on Multimedia, 2010. [8] N. Murray, L. Marchesotti, and F. Perronnin. AVA: A largescale database for aesthetic visual analysis. In CVPR, 2012. [9] X. Wang, J. Jia, J. Yin, and L. Cai. Interpretable aesthetic features for affective image classification. In ICIP, 2013. [10] Z. Xu, D. Tao, Y. Zhang, J. Wu, and A. C. Tsoi. Architectural style classification using multinomial latent logistic regression. In ECCV, 2014. Abstract Tasks in Our Experiment dataset Artphoto Painting-91 Painting-91 AVA HipsterWars arcDataset Memorability Memorability task emotion artist artistic style aesthetic fashion style architectural style memorability interestingness classification classification classification classification classification classification prediction prediction reference [7] [3] [3] [8] [5] [10] [2] dataset: [2]; task: [1] task ID EMO AST ART AVA FAS ARC MEM INT # classes 8 91 13 2 5 10 / 25 regression task regression task # images 806 4266 2338 >250k 1893 2043 / 4786 2222 2222 class labels fear, sad, etc. Rubens, Baroque, high / low Bohemian, Georgian, memorability interestingness Picasso, etc. Cubbism, etc. aesthetic quality Goth, etc. Gothic, etc. # training images 645 2275 1250 233k 853 300 / 750 1111 1982 # testing images 160 1991 1088 19930 92 1743 / 4036 1111 240 data split random specified [3] specified [3] specified [8] random random specified [2] random # fold(s) 5 1 1 1 100 10 25 10 evaluation metric 1-vs-all accuracy accuracy accuracy accuracy accuracy accuracy ρ ρ reference of above setting [9] [3] [3] [8] [5] [10] [2] [1] Performance Using AlexNet task ID EMO AST ART AVA FAS ARC MEM INT evaluation metric 1-vs-all accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ previous work 63.163 [9] 53.100 [3] 62.200 [3] 73.250 [6] 70.971 [5] 69.170 / 46.210 [10] 0.500 [4] 0.600 [1] pt ImageNet + ft 60.127 56.102 68.290 n/a 71.294 71.159 / 52.953 0.520 0.643 pt ImageNet + ft-fc8 64.724 53.541 65.165 n/a 66.228 67.246 / 51.469 -0.140 0.339 pt AVA + ft 59.836 25.615 40.625 n/a 57.337 35.841 / 20.401 0.368 0.511 pt AVA + ft-fc8 60.644 4.671 18.015 n/a 27.554 18.233 / 8.290 0.080 -0.113 train from scratch 61.572 21.698 38.327 74.436 54.304 21.532 / 12.386 0.372 0.382 pt: pre-training; ft: fine-tuning; bold: the best performance; underlined : the performance better than that of training from scratch Concatenating CNN Features We use AlexNet as the CNN architecture of each CNN feature extractor.

Upload: others

Post on 09-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TowardCorrelatingandSolvingAbstractTasksUsingConvolution ...chenlab.ece.cornell.edu/Publication/Kuan-Chuan/WACV16...task ID EMO AST ART AVA FAS ARC MEM INT # classes 8 91 13 2 5 10

TowardCorrelating andSolvingAbstractTasksUsingConvolutionalNeuralNetworksKuan-Chuan Peng and Tsuhan Chen

[email protected], [email protected]

Motivation1. Relatively fewer CNN-related works focus on abstract tasks.

2. Different abstract tasks are treated in a relatively independent fashion inthe prior literature.

Summary of Findings1. Superior performance of CNN-based approaches in abstract tasks.

2. Concatenating CNN features learned from different tasks can enhance theperformance in each task.

3. Concatenating CNN features learned from all the tasks does NOT performthe best.

4. Suggestions of choosing CNN features to use in abstract tasks.

Using the Features Learned from Other Tasks

task ID EMO AST ART AVA FAS ARC MEM INT CALevaluation metric accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ accuracy (%)

self feature 36.228 55.148 67.555 69.423 76.957 54.440 0.398 0.573 88.217best concatenating setting (the table below) 39.082 57.509 71.048 69.980 77.609 55.382 0.507 0.630 88.394

concatenate all features 36.971 54.596 69.210 69.458 74.348 53.489 0.504 0.629 85.969

CAL: Caltech-101 object classification task (30 training images / class); underlined: the performance better than that of using self feature

task ID ART AST CAL ARC EMO AVA FAS MEM INT

F ART v v v v v vF AST v v v vF CAL v v v vF ARC v v v vF EMO v v v v v vF AVA v v v v v vF FAS v v v v v v

F MEM v v v v vF INT v v

F T: the 4096-d feature vector output from the CNN trained with the task T where we call F T ”self feature.”

task ID ART AST CAL ARC EMO AVA FAS MEM INTevaluation metric accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ

F ART 1 (67.555) 2 (48.569) 4 (81.080) 4 (48.845) 7 (31.138) 5 (63.181) 4 (68.478) 3 (0.454) 5 (0.520)F AST 1 (67.555) 1 (55.148) 5 (81.066) 6 (48.320) 6 (31.139) 7 (62.915) 6 (66.739) 6 (0.442) 6 (0.508)F CAL 3 (61.121) 5 (45.907) 1 (88.217) 3 (50.411) 5 (31.886) 3 (63.297) 7 (66.739) 1 (0.464) 8 (0.493)F ARC 4 (60.938) 4 (46.760) 2 (83.667) 1 (54.440) 2 (33.868) 4 (63.277) 2 (71.739) 7 (0.434) 9 (0.487)F EMO 4 (60.938) 3 (47.564) 8 (70.771) 2 (50.629) 1 (36.228) 2 (63.392) 5 (68.043) 2 (0.459) 7 (0.497)F AVA 7 (56.985) 7 (41.487) 6 (80.475) 7 (46.824) 4 (33.372) 1 (69.423) 3 (69.565) 5 (0.445) 1 (0.575)F FAS 6 (57.813) 6 (44.751) 3 (81.514) 5 (48.697) 3 (33.748) 6 (63.006) 1 (76.957) 4 (0.450) 4 (0.542)

F MEM 9 (51.838) 9 (37.167) 9 (65.643) 9 (40.986) 9 (27.170) 9 (61.004) 9 (60.870) 8 (0.398) 3 (0.560)F INT 8 (55.790) 8 (40.532) 7 (74.316) 8 (43.925) 8 (30.029) 8 (61.666) 8 (66.087) 9 (0.346) 2 (0.573)

self feature + F ART n / a 3 (56.404) 4 (88.231) 2 (55.124) 1 (37.095) 6 (69.498) 2 (76.957) 7 (0.459) 6 (0.598)

self feature + F AST 1 (70.129) n / a 5 (87.823) 6 (54.574) 3 (36.560) 3 (69.528) 7 (74.348) 3 (0.466) 4 (0.599)

self feature + F CAL 2 (68.658) 2 (56.605) n / a 5 (54.747) 5 (36.352) 8 (69.323) 4 (76.522) 2 (0.468) 5 (0.598)

self feature + F ARC 2 (68.658) 7 (55.098) 3 (88.244) n / a 7 (35.855) 5 (69.503) 5 (75.652) 5 (0.463) 7 (0.598)

self feature + F EMO 4 (68.382) 1 (57.308) 6 (87.728) 1 (55.149) n / a 7 (69.473) 6 (75.435) 1 (0.470) 3 (0.599)

self feature + F AVA 6 (67.739) 6 (55.349) 1 (88.319) 4 (54.990) 2 (36.601) n / a 1 (77.391) 4 (0.465) 1 (0.610)

self feature + F FAS 7 (67.647) 4 (56.103) 2 (88.299) 3 (55.064) 6 (36.105) 1 (69.729) n / a 6 (0.461) 2 (0.603)

self feature + F MEM 5 (67.831) 5 (55.550) 7 (86.975) 7 (53.271) 4 (36.479) 3 (69.528) 2 (76.957) n / a 8 (0.585)

self feature + F INT 8 (67.096) 8 (53.541) 8 (85.834) 8 (52.359) 8 (35.361) 2 (69.594) 7 (74.348) 8 (0.417) n / a

format: rank (performance); underlined: the performance better than that of using self feature

References

[1] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. V. Gool. The interestingness of images. In ICCV, 2013.

[2] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In CVPR, 2011.

[3] F. S. Khan, S. Beigpour, J. V. D. Weijer, and M. Felsberg. Painting-91: a large scale database for computational painting categorization. In Machine Vision and Applications, 2014.

[4] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In NIPS, 2012.

[5] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In ECCV, 2014.

[6] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. RAPID: rating pictorial aesthetics using deep learning. In ACMMM, 2014.

[7] J. Machajdik, and A. Hanbury. Affective image classification using features inspired by psychology and art theory. In International Conference on Multimedia, 2010.

[8] N. Murray, L. Marchesotti, and F. Perronnin. AVA: A largescale database for aesthetic visual analysis. In CVPR, 2012.

[9] X. Wang, J. Jia, J. Yin, and L. Cai. Interpretable aesthetic features for affective image classification. In ICIP, 2013.

[10] Z. Xu, D. Tao, Y. Zhang, J. Wu, and A. C. Tsoi. Architectural style classification using multinomial latent logistic regression. In ECCV, 2014.

Abstract Tasks in Our Experiment

dataset Artphoto Painting-91 Painting-91 AVA HipsterWars arcDataset Memorability Memorabilitytask emotion artist artistic style aesthetic fashion style architectural style memorability interestingness

classification classification classification classification classification classification prediction predictionreference [7] [3] [3] [8] [5] [10] [2] dataset: [2]; task: [1]task ID EMO AST ART AVA FAS ARC MEM INT

# classes 8 91 13 2 5 10 / 25 regression task regression task# images 806 4266 2338 >250k 1893 2043 / 4786 2222 2222

class labels fear, sad, etc. Rubens, Baroque, high / low Bohemian, Georgian, memorability interestingnessPicasso, etc. Cubbism, etc. aesthetic quality Goth, etc. Gothic, etc.

# training images ∼645 2275 1250 ∼233k 853 300 / 750 1111 1982# testing images ∼160 1991 1088 19930 92 1743 / 4036 1111 240

data split random specified [3] specified [3] specified [8] random random specified [2] random# fold(s) 5 1 1 1 100 10 25 10

evaluation metric 1-vs-all accuracy accuracy accuracy accuracy accuracy accuracy ρ ρreference of above setting [9] [3] [3] [8] [5] [10] [2] [1]

Performance Using AlexNet

task ID EMO AST ART AVA FAS ARC MEM INTevaluation metric 1-vs-all accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) accuracy (%) ρ ρ

previous work 63.163 [9] 53.100 [3] 62.200 [3] 73.250 [6] 70.971 [5] 69.170 / 46.210 [10] 0.500 [4] 0.600 [1]pt ImageNet + ft 60.127 56.102 68.290 n / a 71.294 71.159 / 52.953 0.520 0.643

pt ImageNet + ft-fc8 64.724 53.541 65.165 n / a 66.228 67.246 / 51.469 -0.140 0.339pt AVA + ft 59.836 25.615 40.625 n / a 57.337 35.841 / 20.401 0.368 0.511

pt AVA + ft-fc8 60.644 4.671 18.015 n / a 27.554 18.233 / 8.290 0.080 -0.113train from scratch 61.572 21.698 38.327 74.436 54.304 21.532 / 12.386 0.372 0.382

pt: pre-training; ft: fine-tuning; bold: the best performance; underlined: the performance better than that of training from scratch

Concatenating CNN Features

We use AlexNet as the CNN architecture of each CNN feature extractor.