(dl hacks輪読) how transferable are features in deep neural networks?

25
How transferable are features in deep neural networks? Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson © ř¿Ţu ėŕDL Hacks

Upload: masahiro-suzuki

Post on 19-Jan-2017

221 views

Category:

Technology


1 download

TRANSCRIPT

How transferable are features in deep neural networks? �Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson

�©��ř¿Ţu�

ėŕ�DL Hacks �

wöå=7 8�!  NIPS20149>öå

!  ċ¸¤Ð¹�15�ģv�©+M3öå=,8?Æ ��

!  Deep Learning9ùč¨Ņ�!  34,ē#,J>�, l`oOÚÎ,3: !�Ü9?<&�EP<´=<68 3):OŁ��Ē�ÚĤ,3: !�Ü�!  £Ñ=1P<Ď, ):?�KD0P�

!  öå�9ş68 L��=7 8�ùč¨Ņ�:Á68 L$�)MD9>ùč¨ŅĐĜ¥�>y9>Çëé(<;=7 8?C:P;Áļ0/�!  1G1Gùč¨Ņ=7 8Ł�ö-3öå9?< �!  Íw�=?öå=Ķ68Ċ�,D.�

DNN>general:specific=7 8�!  ĝ cm�nod^ar�U9êŜO¨Ņ.L:�Ø>I!<ŰŬ<µĈ$µML�Ũ ŏ�1ŏ�<;���!  Ti�ogQo\�>I!<e\�s$µML�!  )>µĈ?`�\Z^a�Ûĸ�KHÛĸ<,<;>±¾³��īĉ�¹<;=IJ< �general��

ĝ ŏ�«ìŏ<;���!  `�\Z^aH\YU=I68«ìŏ>{�?u%&ł<L�specific�

�ïŝêŜ?http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/CVPR2012-Tutorial_lee.pdfIK�Ũ � ĝ �

general �specific�

ݱ¶�general:specific=7 8Ø>I!<ݱ$�-L�

)MJ>ݱ$�.L:����

DNN>general<��OE7(8ùč¨Ņ=½¤9%L��

ݱ1. �Lŏ$general#specific#OÄœ.L):$9%L>#��

ݱ2. general�specific>ô¡?�Lŏ9ıĦĄ)L>#�1M:Gēŏ=GĞ68ôNL>#��

ݱ3. 1>ô¡?;>įK>ŏ9Ą)L>#��

ùč¨Ņ�

�ăĂ�))9 !ùč¨Ņ?[Pan et al. 2010]O÷Ú=.L:�ĻŎùč¨Ņ�inductive transfer learning�ł<L\YU�>ùč�=�3L:ßNML�

Ï\YU�base task��

�IJ\YU�target task��

�IJ\YU=�.L`�\$õ< :%�ùč=I68��%�#�������

Figure 1: Overview of the experimental treatments and controls. Top two rows: The base networksare trained using standard supervised backprop on only half of the ImageNet dataset (first row: Ahalf, second row: B half). The labeled rectangles (e.g. WA1) represent the weight vector learned forthat layer, with the color indicating which dataset the layer was originally trained on. The vertical,ellipsoidal bars between weight vectors represent the activations of the network at each layer. Thirdrow: In the selffer network control, the first n weight layers of the network (in this example, n = 3)are copied from a base network (e.g. one trained on dataset B), the upper 8− n layers are randomlyinitialized, and then the entire network is trained on that same dataset (in this example, dataset B).The first n layers are either locked during training (“frozen” selffer treatment B3B) or allowed tolearn (“fine-tuned” selffer treatment B3B+). This treatment reveals the occurrence of fragile co-adaptation, when neurons on neighboring layers co-adapt during training in such a way that cannotbe rediscovered when one layer is frozen. Fourth row: The transfer network experimental treatmentis the same as the selffer treatment, except that the first n layers are copied from a network trainedon one dataset (e.g. A) and then the entire network is trained on the other dataset (e.g. B). Thistreatment tests the extent to which the features on layer n are general or specific.

3 Experimental Setup

Since Krizhevsky et al. (2012) won the ImageNet 2012 competition, there has been much interestand work toward tweaking hyperparameters of large convolutional models. However, in this studywe aim not to maximize absolute performance, but rather to study transfer results on a well-knownarchitecture. We use the reference implementation provided by Caffe (Jia et al., 2014) so that ourresults will be comparable, extensible, and useful to a large number of researchers. Further details ofthe training setup (learning rates, etc.) are given in the supplementary material, and code and param-eter files to reproduce these experiments are available at http://yosinski.com/transfer.

4 Results and Discussion

We performed three sets of experiments. The main experiment has random A/B splits and is dis-cussed in Section 4.1. Section 4.2 presents an experiment with the man-made/natural split. Sec-tion 4.3 describes an experiment with random weights.

4

Figure1: Overview of theexperimental treatments and controls. Top two rows: Thebasenetworksare trained using standard supervised backprop on only half of the ImageNet dataset (first row: Ahalf, second row: B half). The labeled rectangles (e.g.WA 1) represent theweight vector learned forthat layer, with the color indicating which dataset the layer was originally trained on. The vertical,ellipsoidal barsbetween weight vectors represent theactivationsof thenetwork at each layer. Thirdrow: In theselffer network control, thefirst n weight layersof thenetwork (in thisexample, n = 3)arecopied from abasenetwork (e.g. one trained on dataset B), theupper 8− n layersare randomlyinitialized, and then the entire network is trained on that samedataset (in this example, dataset B).The first n layers are either locked during training (“ frozen” selffer treatment B3B) or allowed tolearn (“fine-tuned” selffer treatment B3B+ ). This treatment reveals the occurrence of fragile co-adaptation, when neuronson neighboring layersco-adapt during training in such away that cannotberediscoveredwhen one layer is frozen. Fourth row: The transfer network experimental treatmentis the sameas the selffer treatment, except that the first n layers are copied from anetwork trainedon one dataset (e.g. A) and then the entire network is trained on the other dataset (e.g. B). Thistreatment tests theextent to which the featureson layer n aregeneral or specific.

3 Experimental Setup

Since Krizhevsky et al. (2012) won the ImageNet 2012 competition, there has been much interestand work toward tweaking hyperparameters of large convolutional models. However, in this studyweaim not to maximizeabsolute performance, but rather to study transfer results on awell-knownarchitecture. We use the reference implementation provided by Caffe (Jia et al., 2014) so that ourresultswill becomparable, extensible, anduseful to alargenumber of researchers. Further detailsofthetraining setup (learning rates, etc.) aregiven in thesupplementary material, and codeandparam-eter files to reproduce theseexperimentsareavailableat ht t p: / / yosi nski . com/ t r ansf er .

4 ResultsandDiscussion

We performed three sets of experiments. The main experiment has random A/B splits and is dis-cussed in Section 4.1. Section 4.2 presents an experiment with the man-made/natural split. Sec-tion 4.3 describesan experiment with randomweights.

4

Ï\YU>ÃEO��IJ\YU=Vf��

Ŷùč�

āK>ŏ?ns]j=À®¡&�IJ\YU>`�\9¨Ņ�

wöå9>ùč¨Ņ>ýM�

ùč.Lŏ$general<�$!D& &�

fine-tune:frozen�ùč¨Ņ9?Ï\YU#J�IJ\YU=ùč,3�:�27>��$�Ş9%L�

�1.�Vf�,3ŏO�IJ\YU9ŤĕŋĠŲ�fine-tune��2.�Vf�,3ŏ>ÃE?ôŇ,< �frozen��

�IJ\YU>`�\$õ< ��?frozen>�$ �î¨Ņ=ūL#J�$�`�\$Æ <Jfine-tune>�$ �

�)>�7>��=7 81M2M�Ē�

l`o>³��Ï\YUOA��IJ\YUOB:.L�

!  AGBG8ŏ>űËEcm�nod^ar�U:,�¨ŅÂE>d^ar�UObaseA�baseB:ĺA�¨Ņ.L`�\Z^a?�ġ��!  8ŏ<>9�Vf�9%Lnŏ?{1,2,…,7}:<L�!  �nŏVf�.L�:?|đŨ ÃEO1:,8nđ�D9O¥8Vf�: !°Ī�

Figure1: Overview of theexperimental treatments and controls. Top two rows: Thebasenetworksare trained using standard supervised backprop on only half of the ImageNet dataset (first row: Ahalf, second row: B half). The labeled rectangles (e.g.WA 1) represent theweight vector learned forthat layer, with the color indicating which dataset the layer was originally trained on. The vertical,ellipsoidal barsbetween weight vectors represent theactivationsof thenetwork at each layer. Thirdrow: In theselffer network control, thefirst n weight layersof thenetwork (in thisexample, n = 3)arecopied from abasenetwork (e.g. one trained on dataset B), theupper 8− n layersare randomlyinitialized, and then the entire network is trained on that samedataset (in this example, dataset B).The first n layers are either locked during training (“ frozen” selffer treatment B3B) or allowed tolearn (“fine-tuned” selffer treatment B3B+ ). This treatment reveals the occurrence of fragile co-adaptation, when neuronson neighboring layersco-adapt during training in such away that cannotberediscoveredwhen one layer is frozen. Fourth row: The transfer network experimental treatmentis the sameas the selffer treatment, except that the first n layers are copied from anetwork trainedon one dataset (e.g. A) and then the entire network is trained on the other dataset (e.g. B). Thistreatment tests theextent to which the featureson layer n aregeneral or specific.

3 Experimental Setup

Since Krizhevsky et al. (2012) won the ImageNet 2012 competition, there has been much interestand work toward tweaking hyperparameters of large convolutional models. However, in this studyweaim not to maximizeabsolute performance, but rather to study transfer results on awell-knownarchitecture. We use the reference implementation provided by Caffe (Jia et al., 2014) so that ourresultswill becomparable, extensible, anduseful to alargenumber of researchers. Further detailsofthetraining setup (learning rates, etc.) aregiven in thesupplementary material, and codeandparam-eter files to reproduce theseexperimentsareavailableat ht t p: / / yosi nski . com/ t r ansf er .

4 ResultsandDiscussion

We performed three sets of experiments. The main experiment has random A/B splits and is dis-cussed in Section 4.1. Section 4.2 presents an experiment with the man-made/natural split. Sec-tion 4.3 describesan experiment with randomweights.

4

l`o>³��n=�>:%�

!  B3B�baseB#J3ŏOVf�,frozen,āK>ŏOns]j=À®¡,`�\Z^aB9¨Ņ�7DKùč<,�selffer���

!  A3B�baseA#J3ŏVf�,frozen,āK>ŏOns]j=À®¡,`�\Z^aB9¨Ņ�7DKùč�K�transfer���

!  frozen9?<&�fine-tune>��?B3B+�A3B+:©Ò.L�

Figure1: Overview of theexperimental treatments and controls. Top two rows: Thebasenetworksare trained using standard supervised backprop on only half of the ImageNet dataset (first row: Ahalf, second row: B half). The labeled rectangles (e.g.WA 1) represent theweight vector learned forthat layer, with the color indicating which dataset the layer was originally trained on. The vertical,ellipsoidal barsbetween weight vectors represent theactivationsof thenetwork at each layer. Thirdrow: In theselffer network control, thefirst n weight layersof thenetwork (in thisexample, n = 3)arecopied from abasenetwork (e.g. one trained on dataset B), theupper 8− n layersare randomlyinitialized, and then the entire network is trained on that samedataset (in this example, dataset B).The first n layers are either locked during training (“ frozen” selffer treatment B3B) or allowed tolearn (“fine-tuned” selffer treatment B3B+ ). This treatment reveals the occurrence of fragile co-adaptation, when neuronson neighboring layersco-adapt during training in such away that cannotberediscoveredwhen one layer is frozen. Fourth row: The transfer network experimental treatmentis the sameas the selffer treatment, except that the first n layers are copied from anetwork trainedon one dataset (e.g. A) and then the entire network is trained on the other dataset (e.g. B). Thistreatment tests theextent to which the featureson layer n aregeneral or specific.

3 Experimental Setup

Since Krizhevsky et al. (2012) won the ImageNet 2012 competition, there has been much interestand work toward tweaking hyperparameters of large convolutional models. However, in this studyweaim not to maximizeabsolute performance, but rather to study transfer results on awell-knownarchitecture. We use the reference implementation provided by Caffe (Jia et al., 2014) so that ourresultswill becomparable, extensible, anduseful to alargenumber of researchers. Further detailsofthetraining setup (learning rates, etc.) aregiven in thesupplementary material, and codeandparam-eter files to reproduce theseexperimentsareavailableat ht t p: / / yosi nski . com/ t r ansf er .

4 ResultsandDiscussion

We performed three sets of experiments. The main experiment has random A/B splits and is dis-cussed in Section 4.1. Section 4.2 presents an experiment with the man-made/natural split. Sec-tion 4.3 describesan experiment with randomweights.

4

�)>ħ9?��q^U�frozen �q^U$ÑĮ=fine-tune���

)MO¥8>n:ñ�»>ùč�AnB:BnA�=�,8�Ē.L�

l`o>³��)>³�O.L):9��Ē>Ìä=I68Ø>):$ "L�

!  G,A3B$baseB:�-&J >²á<J@�3&����!����B�� �general: "L�

!  G,A3B$baseB=þB8²á$ğ&<63J�3&�� (�A��� specific: "L�

Figure1: Overview of theexperimental treatments and controls. Top two rows: Thebasenetworksare trained using standard supervised backprop on only half of the ImageNet dataset (first row: Ahalf, second row: B half). The labeled rectangles (e.g.WA 1) represent theweight vector learned forthat layer, with the color indicating which dataset the layer was originally trained on. The vertical,ellipsoidal barsbetween weight vectors represent theactivationsof thenetwork at each layer. Thirdrow: In theselffer network control, thefirst n weight layersof thenetwork (in thisexample, n = 3)arecopied from abasenetwork (e.g. one trained on dataset B), theupper 8− n layersare randomlyinitialized, and then the entire network is trained on that samedataset (in this example, dataset B).The first n layers are either locked during training (“ frozen” selffer treatment B3B) or allowed tolearn (“fine-tuned” selffer treatment B3B+ ). This treatment reveals the occurrence of fragile co-adaptation, when neuronson neighboring layersco-adapt during training in such away that cannotberediscoveredwhen one layer is frozen. Fourth row: The transfer network experimental treatmentis the sameas the selffer treatment, except that the first n layers are copied from anetwork trainedon one dataset (e.g. A) and then the entire network is trained on the other dataset (e.g. B). Thistreatment tests theextent to which the featureson layer n aregeneral or specific.

3 Experimental Setup

Since Krizhevsky et al. (2012) won the ImageNet 2012 competition, there has been much interestand work toward tweaking hyperparameters of large convolutional models. However, in this studyweaim not to maximizeabsolute performance, but rather to study transfer results on awell-knownarchitecture. We use the reference implementation provided by Caffe (Jia et al., 2014) so that ourresultswill becomparable, extensible, anduseful to alargenumber of researchers. Further detailsofthetraining setup (learning rates, etc.) aregiven in thesupplementary material, and codeandparam-eter files to reproduce theseexperimentsareavailableat ht t p: / / yosi nski . com/ t r ansf er .

4 ResultsandDiscussion

We performed three sets of experiments. The main experiment has random A/B splits and is dis-cussed in Section 4.1. Section 4.2 presents an experiment with the man-made/natural split. Sec-tion 4.3 describesan experiment with randomweights.

4

`�\Z^a=7 8�`�\Z^a?ImageNet[Deng et al., 2009]O�

!  1000UnY�ŪŌ`�\�1281167ţ�_Ya`�\�50000ţ

!  A:B9¨Ņ.L3F=`�\OÙ��500UnY�¦645000ţ�=�(L

!  9%L4(ŭ8 L`�\Z^a=�(L�!  ImageNet?Ĺŏąď9�ŭ8 LUnY$UnY\=<68 L>9�UnY\*:=ʼnņ=<LI!=ns]j=�(L�Ė�dVij?13UnY�L>9�6:7=ns]j=�(L��

!  ŭ8 < `�\Z^a>�(��!  ĹŏąďO½¤,8man-made�551UnY�:natural �449UnY�=�(L

ùč¨Ņ?ŭ8 L�$!D& &��>�Ē9ÊÞ��

�ň=7 8�!  �ň?Caffe [Jia et al., 2014]O½¤�

!  wĐĜ9?IKI Ŋ�OÓFLĐĜ9?< >9�Æ&>x=ãNM8 LCaffeO�Ş�

!  ũ, ³�Henk�\�<;?http://yosinski.com/transfer9ª�+M8 L�!  ipython notebook9wöå=Ř08�LħOŮê9%L�

�Ē1�ŭ8 L`�\Z^a�

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

•  �ĒÌä�

2 3 4 5Layer n at whLFh network Ls Fhopped and retraLned

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

¼ʼn·Ñû9>ĬÔ�•  A:B>1M2M9ns]j=�è�4��~��•  AnAGBnBGBnB:©Ò�ÿÄ�=ņ, #J�AnB=7 8G�-��

�Ē1>ÚĤ�

ÚĤ1

!  baseB>Ìä>Rn�û?37.5%9�1000UnY>:%>Rn�û42.5%IKć&<68 L�!  500UnY>�$1000UnYIK�Č"��7DK�ķÅ�GÙ�=<68 L3F

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

�Ē1>ÚĤ�

ÚĤ2

!  ēĨ#BnB>·Ñû$4,5ŏįK9ï=ü58 L��-B4#Jü5L?/$< ��

�Ï>NN$1>ŏįK='�����$"(co-adaptation)>ïŝOęP9 LÞŐ�!  1M2M>ïŝ$Ŗš9ŗMH.&Ś =Õ¤,8 L3F�frozen>I!=�ĩ,8,D!:�z>ŏD9>ŤĕŋĠŲ4(9?��%�����|ŧ=¨Ņ,<(M@<J< ��

!  6,7ŏ�94P4P·Ñû$ō68 L>?�ĝ ŏ=~(@~&C;�ų$K�GIKWsho=<68�íĽú>ïŝGõ<&<68��ŵĆ�9GI ÑO�7(JML#J�

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

�Ē1>ÚĤ�

ÚĤ3

!  BnB+OEL:�;>ŏ9GbaseB:u��-&J >·Ñû=<68 L�

��fine-tuning=I68BnB9>±¾$ÉĴ+M3

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

�Ē1>ÚĤ�

ÚĤ4

!  AnBOEL:�1,2ŏ?·Ñû?�DKć�0/�!D&ùč9%8 L):$N#L��1ŏ4(9<&�2ŏ>ïŝGgeneral9�LÞŐ�

!  3ŏ9?N/#=ć�,�4ŏ~7ŏ?#<K·Ñû$ü58 L��ÚĤ2#J��3, 4ŏ�íĽú$ĉNM33F�5~7ŏ�íĽú$ĉNM3&specific=<633F�

9�L:N#L

0 1 2 3 4 5 6 70.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

7op

-1 a

ccu

racy

(h

igh

er

is b

ett

er)

baseB

selffer BnB

selffer BnB+

transfer AnB

transfer AnB+

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

�Ē1>ÚĤ�

ÚĤ5

!  AnB+OEL:�·Ñû$baseBHBnB+:þů,8GI&<68 L��!  )MD9?��IJ\YU>`�\$õ< :%=î¨ŅOĀ'3Fùč¨Ņ$ :+M8 3�

!  ,#,��>Ìä#J�`�\$Æ :%9Gùč¨Ņ=I68Ŵ¡²áO�FL):$N#63�)M$ŪŌ��$} #J9< >?BnB+:>þů9�J#��

!  fine-tuning,3�:9GÏ\YU>Ěä$ā68 L��

!  Ŵ¡=ILÉĴ?�ùč,3ŏ=IJ< �!  ùč.LŏOÈH,3�$N/#=I&<68 L�ĵzħ���

Table1: Performanceboost of AnB+ over controls, averaged over different rangesof layers.mean boost mean boost

layers over overaggregated baseB selffer BnB+

1-7 1.6% 1.4%3-7 1.8% 1.4%5-7 2.1% 1.7%

4.2 Dissimilar Datasets: SplittingMan-madeandNatural ClassesIntoSeparateDatasets

Asmentioned previously, theeffectivenessof feature transfer is expected to declineas thebaseandtarget tasks become less similar. We test this hypothesis by comparing transfer performance onsimilar datasets (the random A/B splits discussed above) to that on dissimilar datasets, created byassigningman-madeobject classestoA andnatural object classestoB. Thisman-made/natural splitcreatesdatasetsasdissimilar aspossiblewithin the ImageNet dataset.Theupper-left subplot of Figure3 showstheaccuracy of abaseA and baseB network (whitecircles)and BnA and AnB networks (orange hexagons). Lines join common target tasks. The upper of thetwo linescontainsthosenetworkstrained toward thetarget task containing natural categories(baseBandAnB). Thesenetworksperformbetter than thosetrained toward theman-madecategories, whichmay bedue to having only 449 classes instead of 551, or simply being an easier task, or both.

4.3 RandomWeights

Wealso compare to random, untrained weights because Jarrett et al. (2009) showed— quite strik-ingly — that the combination of random convolutional filters, rectification, pooling, and local nor-malization canwork almost aswell as learned features. They reported this result on relatively smallnetworksof twoor threelearned layersandon thesmaller Caltech-101dataset (Fei-Fei et al., 2004).It isnatural to ask whether or not thenearly optimal performanceof randomfiltersthey report carriesover to adeeper network trained on a larger dataset.Theupper-right subplot of Figure 3 shows theaccuracy obtained when using random filters for thefirst n layers for various choices of n. Performance falls off quickly in layers 1 and 2, and thendrops to near-chance levels for layers 3+, which suggests that getting random weights to work inconvolutional neural networksmay not be as straightforward as it was for the smaller network sizeand smaller dataset used by Jarrett et al. (2009). However, the comparison is not straightforward.Whereas our networks havemax pooling and local normalization on layers 1 and 2, just as Jarrettet al. (2009) did, we use a different nonlinearity (relu(x) instead of abs(tanh(x))), different layersizesand number of layers, aswell asother differences. Additionally, their experiment only consid-ered two layers of random weights. The hyperparameter and architectural choices of our networkcollectively provideonenew datapoint, but it may well bepossible to tweak layer sizesand randominitialization details to enablemuch better performance for randomweights.5

The bottom subplot of Figure 3 shows the results of the experiments of the previous two sectionsafter subtracting the performance of their individual base cases. These normalized performancesare plotted across the number of layers n that are either random or were trained on a different,basedataset. This comparison makes two thingsapparent. First, the transferability gap when usingfrozen featuresgrowsmorequickly asn increases for dissimilar tasks (hexagons) than similar tasks(diamonds), with adrop by thefinal layer for similar tasksof only 8% vs. 25% for dissimilar tasks.Second, transferring even fromadistant task isbetter than using randomfilters. Onepossiblereasonthis latter result may differ from Jarrett et al. (2009) is because their fully-trained (non-random)networkswereoverfittingmoreon thesmaller Caltech-101 dataset than ourson the larger ImageNet

informative, however, because the performance at each layer is based on different random draws of the upperlayer initialization weights. Thus, the fact that layers 5, 6, and 7 result in almost identical performanceacrossrandom drawssuggests that multiple runsat agiven layer would result in similar performance.

5For example, the training lossof thenetwork with three random layers failed to converge, producing onlychance-level validationperformance. Muchbetter convergencemay bepossiblewithdifferent hyperparameters.

7

�Ē1>ÚĤ>D:F�0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7Layer n at whLFh network Ls Fhopped and retraLned

0.54

0.56

0.58

0.60

0.62

0.64

7op

-1 a

FFu

raFy

(h

Lgh

er

Ls b

ett

er)

5: 7ransfer + fLne-tunLng LPproves generalLzatLon

3: )Lne-tunLng reFovers Fo-adapted LnteraFtLons

2: 3erforPanFe drops due to fragLle Fo-adaptatLon

4: 3erforPanFedrops due to

representatLonspeFLfLFLty

Figure2: The results from this paper’smain experiment. Top: Each marker in the figure representsthe average accuracy over the validation set for a trained network. The white circles above n =0 represent the accuracy of baseB. There are eight points, because we tested on four separaterandom A/B splits. Each dark blue dot represents a BnB network. Light blue points representBnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and lightred diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visualclarity. Bottom: Lines connecting themeansof each treatment. Numbered descriptions aboveeachline refer to which interpretation fromSection 4.1 applies.

4.1 Similar Datasets: RandomA/B splits

The results of all A/B transfer learning experiments on randomly split (i.e. similar) datasets areshown3 in Figure2. The resultsyield many different conclusions. In each of the following interpre-tations, wecompare theperformance to thebasecase (whitecirclesand dotted line in Figure2).

3AnA networks and BnB networks are statistically equivalent, because in both cases a network is trainedon 500 random classes. To simplify notation we label theseBnB networks. Similarly, wehaveaggregated thestatistically identical BnA andAnB networksand just call themAnB.

5

�Ē2�ŭ8 < `�\Z^a�•  �ĒÌä�

0 1 2 3 4 5 6 7

0.3

0.4

0.5

0.6

0.7

7Rp

-1 a

ccu

racy

0an-made/1atural splLt

Figure3: Performancedegradation vs. layer. Top left: Degradation when transferring between dis-similar tasks (fromman-madeclassesof ImageNet to natural classesor vice versa). Theupper lineconnects networks trained to the “natural” target task, and the lower line connects those trained to-ward the“man-made” target task. Top right: Performancewhen thefirst n layersconsist of random,untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1(red diamonds), all normalized by subtracting their base level performance.

dataset, making their random filters perform better by comparison. In the supplementary material,weprovidean extraexperiment indicating theextent to which our networksareoverfit.

5 ConclusionsWe have demonstrated a method for quantifying the transferability of features from each layer ofa neural network, which reveals their generality or specificity. We showed how transferability isnegatively affected by two distinct issues: optimization difficulties related to splitting networks inthemiddleof fragilely co-adapted layersand thespecializationof higher layer featuresto theoriginaltask at the expense of performance on the target task. We observed that either of these two issuesmay dominate, depending on whether features are transferred from the bottom, middle, or top ofthe network. We also quantified how the transferability gap grows as the distance between tasksincreases, particularly when transferring higher layers, but found that even features transferred fromdistant tasks are better than random weights. Finally, we found that initializing with transferredfeatures can improve generalization performance even after substantial fine-tuning on a new task,which could beagenerally useful technique for improving deep neural network performance.

AcknowledgmentsThe authorswould like to thank Kyunghyun Cho and ThomasFuchs for helpful discussions, JoostHuizinga, Anh Nguyen, and Roby Velez for editing, as well as funding from the NASA SpaceTechnology Research Fellowship (JY), DARPA project W911NF-12-1-0449, NSERC, Ubisoft, andCIFAR (YB isaCIFAR Fellow).

8

baseA$man-made, baseB$natural �•  ěĢ������baseA���:baseB�z�>·Ñû�•  SpsXńŔĘ�BnA���:AnB�z�>·Ñû�

•  ÚĤ�•  natural>�man-madeIK$·Ñû$� �

•  natural�449UnY�>�$man-made�551UnY�IKUnY¹$õ<

•  natural>�$ŠĿ<\YU�: !£ó$Ú"JML�

�Ē3�ns]j<ÃE�!  [Jarrett et al. 2009]=IL:�űËEgQo\��Rectification<;Ons]j=,8G¨Ņ,3>:�-&J >Ŋ�:<LJ, !  þů�¢+ NN�2,3ŏ��!  þů�¢+ `�\Z^a�Caltech-101��

��ĝ NN&u%<`�\Z^a9G�-<>#�

�Ē3>Ìä�!  �ĒÌä�

!  ÚĤ�!  1,2ŏ:ü5æF�3ŏºİ$u%&·Ñû$�$68 L�

!  ¢+ NNH`�\Z^a>I!=Ŀť9?< �!  34,[Jarrett et al. 2009]:?Ł�³�$ł<L>9�Ŀť=þůG9%< ):=ő°�

0 1 2 3 4 5 6 70.0

0.1

0.2

0.3

0.4

0.5

0.65andRm, untraLned fLlters

Figure3: Performancedegradation vs. layer. Top left: Degradation when transferring between dis-similar tasks (fromman-madeclassesof ImageNet to natural classesor vice versa). Theupper lineconnects networks trained to the “natural” target task, and the lower line connects those trained to-ward the“man-made” target task. Top right: Performancewhen thefirst n layersconsist of random,untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1(red diamonds), all normalized by subtracting their base level performance.

dataset, making their random filters perform better by comparison. In the supplementary material,weprovidean extraexperiment indicating theextent to which our networksareoverfit.

5 ConclusionsWe have demonstrated a method for quantifying the transferability of features from each layer ofa neural network, which reveals their generality or specificity. We showed how transferability isnegatively affected by two distinct issues: optimization difficulties related to splitting networks inthemiddleof fragilely co-adapted layersand thespecializationof higher layer featuresto theoriginaltask at the expense of performance on the target task. We observed that either of these two issuesmay dominate, depending on whether features are transferred from the bottom, middle, or top ofthe network. We also quantified how the transferability gap grows as the distance between tasksincreases, particularly when transferring higher layers, but found that even features transferred fromdistant tasks are better than random weights. Finally, we found that initializing with transferredfeatures can improve generalization performance even after substantial fine-tuning on a new task,which could beagenerally useful technique for improving deep neural network performance.

AcknowledgmentsThe authorswould like to thank Kyunghyun Cho and ThomasFuchs for helpful discussions, JoostHuizinga, Anh Nguyen, and Roby Velez for editing, as well as funding from the NASA SpaceTechnology Research Fellowship (JY), DARPA project W911NF-12-1-0449, NSERC, Ubisoft, andCIFAR (YB isaCIFAR Fellow).

8

�Ē1,2,3>Ìä>þů�

0 1 2 3 4 5 6 7/ayer n at whLch netwRrN Ls chRpped and retraLned

−0.30

−0.25

−0.20

−0.15

−0.10

−0.05

0.005

ela

tLve t

Rp

-1 a

ccu

racy

(h

Lgh

er

Ls b

ett

er)

reference

mean AnB, randRm splLts

mean AnB, m/n splLt

randRm features

Figure3: Performancedegradation vs. layer. Top left: Degradation when transferring between dis-similar tasks (fromman-madeclassesof ImageNet to natural classesor vice versa). Theupper lineconnects networks trained to the “natural” target task, and the lower line connects those trained to-ward the“man-made” target task. Top right: Performancewhen thefirst n layersconsist of random,untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1(red diamonds), all normalized by subtracting their base level performance.

dataset, making their random filters perform better by comparison. In the supplementary material,weprovidean extraexperiment indicating theextent to which our networksareoverfit.

5 ConclusionsWe have demonstrated a method for quantifying the transferability of features from each layer ofa neural network, which reveals their generality or specificity. We showed how transferability isnegatively affected by two distinct issues: optimization difficulties related to splitting networks inthemiddleof fragilely co-adapted layersand thespecializationof higher layer featuresto theoriginaltask at the expense of performance on the target task. We observed that either of these two issuesmay dominate, depending on whether features are transferred from the bottom, middle, or top ofthe network. We also quantified how the transferability gap grows as the distance between tasksincreases, particularly when transferring higher layers, but found that even features transferred fromdistant tasks are better than random weights. Finally, we found that initializing with transferredfeatures can improve generalization performance even after substantial fine-tuning on a new task,which could beagenerally useful technique for improving deep neural network performance.

AcknowledgmentsThe authorswould like to thank Kyunghyun Cho and ThomasFuchs for helpful discussions, JoostHuizinga, Anh Nguyen, and Roby Velez for editing, as well as funding from the NASA SpaceTechnology Research Fellowship (JY), DARPA project W911NF-12-1-0449, NSERC, Ubisoft, andCIFAR (YB isaCIFAR Fellow).

8

ÚĤ�•  ŏ$ĝ&<LC;�ŭ8 L`�\Z^a:ŭ8 < `�\Z^a>ĕ$� 8 L�

•  ŭ8 < `�\Z^a9Gùč,3�$�ns]jIKGI Ìä�•  [Jarrett et al. 2009]>Ìä:ł<L>?�[Jarrett et al. 2009]9?¢+ `�\Z^a�Caltech-101�9ŪŌ,8 8î¨Ņ=ū68 L#J:Ú"JML�4#J�ns]j<�$!D& &��

D:F�wĐĜ9?NN>;>ŏ$ùč9%L#�.<N5general:specific$;>ŏ9Ą)68 L#O�J#=.L��OÖ×,3 1,8�Ē=I68Ø>):$�J#=<63

!  frozen>³�9ùč,3��Ø>27>¬ŀ=I68²á$ü5L�!  ŗMH. íĽú>�O�ĩ�!  specific<ïŝOùč�

!  \YU�>Ŧĩ$u%&<L:�ùč,3�:>²á$�$L�

!  ,#,ns]j<��:þBL:�ŭ8 < \YU9Gùč,3�$I Ìä:<63�

!  ùč,3�:fine-tuning.L:�;>ŏ9GŴ¡²á$»z,3�

ðĔ�!  �Ē�ÚĤ$#<Kuô1!�

!  §­>Äç Ń&��$ò¯�GPU99.5t:#â 8�63��

!  DL>ĐĜ9?���>I!<�Ē,8E3ĭ>ĐĜ$ï�!  NNH`�\$uøĥ=<68&L:�;!<68L>#N#J< ):$Æ&<L>9���>I!<öå?È"8%1!�

!  [�YV�b>ª�?ï�!  H?KDL9?൲$±¾=<68&L>9�uøĥ<�Ē>��?�[�YV�b>ª�?òŒ��

!  enk�\:#öå=â%%M< �

÷Úåś<;�!  ùč¨Ņ

!  R. Caruana, “Multitask learning,” Mach. Learn., vol. 28(1), pp. 41–75, 1997.

!  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.

!  ùč¨Ņ:Deep Learning !  Y. Bengio, “Deep Learning of Representations for Unsupervised and

Transfer Learning,” JMLR Work. Conf. Proc. 7, vol. 7, pp. 1–20, 2011.

!  1>ľ !  w�Ē>�ň�http://yosinski.com/transfer

!  CVPR 2012 Tutorial : Deep Learning Methods for Vision�êŜ>¸¤�http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/CVPR2012-Tutorial_lee.pdf