(dl輪読)matching networks for one shot learning

28
Matching Networks for One Shot Learning ŘÇŞ

Upload: masahiro-suzuki

Post on 13-Apr-2017

262 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: (DL輪読)Matching Networks for One Shot Learning

Matching Networks for One Shot Learning

ŘÇŞ�

Page 2: (DL輪読)Matching Networks for One Shot Learning

�āí60�1¤ Deep Mind7āí

¤ Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra

¤ 2016/06/13 arXiv

¤ One-shot learning7´�7�¥¤ Matching Nets3��u`zKßÔ$state-of-the art¤ ÄÆ65/1�,E�572½65/1�,¤ ŀĎ6æĨ�MO`M,-$Ň�����­�é�1�5�

¤ �J(1one-shot learning7ŭ·DŐĿKŪĢ$>&

Page 3: (DL輪読)Matching Networks for One Shot Learning

One-shot learning38

Page 4: (DL輪読)Matching Networks for One Shot Learning

One-shot learning¤ One-shot learning38���

¤ ijñ$,�[VR7ãĽ�G�Ġ�10$�5�¹Æ¼�¤ ��63/18µ,G�62�H�²ů¯ņ7¹Æ¼�3$18ĪĖ$�

¤ ¾ý7deep learning7¶Ĉ�¥38�İ6�Hâ��¤ Deep learning28ĀRxV60�1�Ĝ7ŧŋ�Ġ��H!3��ߤ ��7E�5AIK¨Í&,A7Ì·5¹Æ¼��

¤ Ăâ�One-shot learning7®±�5Ęĩ�¤ Li Fei-Fei�¹Æ¼�7ßÔ��Brenden LakeRuslan Salakhutdinov

Joshua B. Tenenbaum54

One shot learning of simple visual concepts

Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum

Department of Brain and Cognitive SciencesMassachusetts Institute of Technology

Abstract

People can learn visual concepts from just one en-counter, but it remains a mystery how this is accom-plished. Many authors have proposed that transferredknowledge from more familiar concepts is a route toone shot learning, but what is the form of this abstractknowledge? One hypothesis is that the sharing of partsis core to one shot learning, but there have been fewattempts to test this hypothesis on a large scale. Thispaper works in the domain of handwritten characters,which contain a rich component structure of strokes.We introduce a generative model of how characters arecomposed from strokes, and how knowledge from previ-ous characters helps to infer the latent strokes in novelcharacters. After comparing several models and humanson one shot character learning, we find that our strokemodel outperforms a state-of-the-art character model bya large margin, and it provides a closer fit to human per-ceptual data.

Keywords: category learning; transfer learning;Bayesian modeling; neural networks

A hallmark of human cognition is learning from just afew examples. For instance, a person only needs to seeone Segway to acquire the concept and be able to dis-criminate future Segways from other vehicles like scoot-ers and unicycles (Fig. 1 left). Similarly, children can ac-quire a new word from one encounter (Carey & Bartlett,1978). How is one shot learning possible?

New concepts are almost never learned in a vacuum.Experience with other, more familiar concepts in a do-main can support more rapid learning of novel conceptsby showing the learner what aspects of objects matterfor generalization. Many authors have suggested thisas a route to one shot learning: transfer of abstractknowledge from old to new concepts, often called trans-

fer learning, representation learning, or learning to learn.But what is the nature of the learned abstract knowl-edge, the learned representational capacities, that letshumans learn new object concepts so quickly?

The most straightforward proposals invoke attentionallearning (Smith, Jones, Landau, Gershko↵-Stowe, &Samuelson, 2002) or overhypotheses (Kemp, Perfors, &Tenenbaum, 2007; Dewar & Xu, in press), like the shapebias in word learning. Given several dimensions alongwhich objects may be similar, prior experience with con-cepts that are clearly organized along one dimension(e.g., shape, as opposed to color or material) draws alearner’s attention to that same dimension (Smith et al.,2002) – or increases the prior probability of new conceptsconcentrating on that same dimension, in a hierarchicalBayesian model of overhypothesis learning (Kemp et al.,2007). But this approach is limited since it requires thatrelevant dimensions of similarity be defined in advance.

Where are the others?

Figure 1: Test yourself on one shot learning. Fromthe example boxed in red, can you find the others inthe grid below? On the left is a Segway and onthe right is the first character of the Bengali alphabet.

AnswerfortheBengalicharacter:Row2,Column2;Row3,Column4.

Figure 2: Examples from a new 1600 character database.

In contrast, for many interesting, real-world concepts,the relevant dimensions of similarity may be constructedin the course of learning to learn. For instance, when wefirst see a Segway, we may parse it into a structure offamiliar parts arranged in a novel configuration: it hastwo wheels, connected by a platform, supporting a motor

and a central post at the top of which are two handle-

bars. Such parts and their relations comprise a usefulrepresentational basis for many di↵erent vehicle and ar-tifact concepts more generally – a representation that islikely learned in the course of learning the many di↵er-ent object concepts that they support. Several papersfrom the recent machine learning and computer visionliterature argue for such an approach: joint learning ofmany concepts and a high-level part vocabulary that un-derlies those concepts (e.g., Torralba, Murphy, & Free-man, 2007; Fei-Fei, Fergus, & Perona, 2006). Anotherrecently popular machine learning approach is based ondeep learning (LeCun, Bottou, Bengio, & Ha↵ner, 1998;Salakhutdinov & Hinton, 2009): unsupervised learning

[Lake+ 2011]One shot learning of simple visual concepts

Page 5: (DL輪読)Matching Networks for One Shot Learning

Zero-shot learning37ē�¤ Zero-shot learning8ijñ$,�RxV7`�[�10B5���7¹Æ¼�¤ *7>>28ijñ2�5�72ĝF�7ċģ×È�side information�KČ�Hü·��H�Xr~_N^RÙijRxV�7�ï54�

¤ One-shot learning8ċģ×È8êJ'6�ļ$,�RxVº7�Ĝ7`�[KŬ&H�ŦŇ8�į�

[Socher+ 2013] Zero-Shot Learning Through Cross-Modal Transfer

Page 6: (DL輪読)Matching Networks for One Shot Learning

ĄĔ¯ņ37ē�¤ ĄĔ¯ņ8¾ýÖ�1�H¹Æ38Ņ5H¹Æ¼�7ÙijKŬ$1�¯ņň¦KÃ�#(HŎË?

¤ One-shot learning8ĄĔ¯ņ7�.ŁŌĄĔ¯ņ7�03â�H!3�2�H¤ ŁŌĄĔ¯ņ�¨ĺ[VR�Õ[VR3Ņ5H¼�¤ ø6*7�2B¨ĺ[VRŤ°ħ7¨ĺ[VR�Ġ�İő6ÿ5�Ġ3â�H!3B2�H

[Pan+ 2010] A Survey on Transfer Learning

Page 7: (DL輪読)Matching Networks for One Shot Learning

Ăâ�ĄĔ¯ņ7�ļ¤ btO~3[VR7ē�

¤ �ļ�[Pan+2010]D[åų+2010]KÕ6ŘÇ�Þ»�¤ ŁŌĄĔ¯ņ�btO~��%�[VR�Ņ5H�

¤ ¨ĺ[VRŤ°ħ�ú[VR2xnz�G�¤ �Ŕãà¯ņ�Õ[VR7xnz5$�¤ Y|Uw^a¯ņ�¨ĺ[VR7xnz5$�

¤ ax~V\R_NkĄĔ¯ņ�[VR��%�btO~�Ņ5H�¤ btO~łą�øŜć��Ņ5H�¤ ŅĮĄĔ�øŜć�7áÕ�Ņ5H�¤ ôþĜUja�Ļĸ�œ�Ņ5H�

¤ ãĽ5$ĄĔ¯ņ�[VR�btO~3B6Ņ5H�

btO~

[VR

Ņ5H`�[7�œ or �čĚ

Ņ5H�ļ¹Æ

Õ[VR ¨ĺ[VR

ÕbtO~

¨ĺbtO~

Page 8: (DL輪読)Matching Networks for One Shot Learning

One-shot learning7Ï�Ęĩ

Page 9: (DL輪読)Matching Networks for One Shot Learning

One-shot learning7�¥�īō¯ņÂ��

¤ Fei-FeiF6E/1ÉA1ßÔ [Fei-Fei+ 2006]¤ þ�ıāK¬�,�¥�¤�8Êŗ�¤ Zero-shot learning7³GË?8!IEGB��[Larochelle+ 2008]�

¤ Hierarchical Bayesian Program Learning (HBPL) [Lake+ 2011; 2012; 2013; 2015]¤ ��8107�Ġ�Fijñ�»54K&H!3�2�H¤ Űĕ/1ľōnOW�6�»u`zK¼Î&H!32�Ú7E�5!3�2�H!3KîØ$,�,-$A.C�.CMbp^R�

for each subpart. Last, parts are roughly positionedto begin either independently, at the beginning, atthe end, or along previous parts, as defined byrelation Ri (Fig. 3A, iv).Character tokens q(m) are produced by execut-

ing the parts and the relations andmodeling howink flows from the pen to the page. First, motornoise is added to the control points and the scaleof the subparts to create token-level stroke tra-jectories S(m). Second, the trajectory’s precise startlocation L(m) is sampled from the schematic pro-vided by its relationRi to previous strokes. Third,global transformations are sampled, includingan affine warp A(m) and adaptive noise parame-ters that ease probabilistic inference (30). Last, abinary image I (m) is created by a stochastic ren-dering function, lining the stroke trajectorieswith grayscale ink and interpreting the pixelvalues as independent Bernoulli probabilities.Posterior inference requires searching the large

combinatorial space of programs that could havegenerated a raw image I (m). Our strategy uses fastbottom-up methods (31) to propose a range ofcandidate parses. The most promising candidatesare refined by using continuous optimization

and local search, forming a discrete approxima-tion to the posterior distribution P(y , q(m)|I (m))(section S3). Figure 4A shows the set of discov-ered programs for a training image I (1) andhow they are refit to different test images I (2) tocompute a classification score log P(I (2)|I (1)) (thelog posterior predictive probability), where higherscores indicate that they are more likely to be-long to the same class. A high score is achievedwhen at least one set of parts and relations cansuccessfully explain both the training and thetest images, without violating the soft constraintsof the learned within-class variability model.Figure 4B compares the model’s best-scoringparses with the ground-truth human parses forseveral characters.

Results

People, BPL, and alternative models were com-pared side by side on five concept learning tasksthat examine different forms of generalizationfrom just one or a few examples (example taskFig. 5). All behavioral experiments were runthrough Amazon’s Mechanical Turk, and the ex-perimental procedures are detailed in section S5.

The main results are summarized by Fig. 6, andadditional lesion analyses and controls are re-ported in section S6.One-shot classification was evaluated through

a series of within-alphabet classification tasks for10 different alphabets. As illustrated in Fig. 1B, i,a single image of a new character was presented,and participants selected another example of thatsame character from a set of 20 distinct char-acters produced by a typical drawer of that alpha-bet. Performance is shown in Fig. 6A,where chanceis 95% errors. As a baseline, themodifiedHausdorffdistance (32) was computed between centeredimages, producing 38.8% errors. People wereskilled one-shot learners, achieving an averageerror rate of 4.5% (N = 40). BPL showed a similarerror rate of 3.3%, achieving better performancethan adeep convolutional network (convnet; 13.5%errors) and the HDmodel (34.8%)—each adaptedfrom deep learning methods that have performedwell on a range of computer vision tasks. A deepSiamese convolutional network optimized for thisone-shot learning task achieved 8.0% errors (33),still about twice as high as humans or ourmodel.BPL’s advantage points to the benefits ofmodelingtheunderlying causal process in learning concepts,a strategy different from the particular deep learn-ing approaches examined here. BPL’s other keyingredients also make positive contributions, asshown by higher error rates for BPL lesionswithout learning to learn (token-level only) orcompositionality (11.0% errors and 14.0%, respec-tively). Learning to learn was studied separatelyat the type and token level by disrupting thelearned hyperparameters of the generativemodel.Compositionality was evaluated by comparingBPL to a matched model that allowed just onespline-based stroke, resembling earlier analysis-by-synthesis models for handwritten charactersthat were similarly limited (34, 35).The human capacity for one-shot learning is

more than just classification. It can include a suiteof abilities, such as generating new examples of aconcept. We compared the creative outputs pro-duced by humans and machines through “visualTuring tests,”where naive human judges tried toidentify the machine, given paired examples ofhuman and machine behavior. In our most basictask, judges compared the drawings from ninehumans asked to produce a new instance of aconcept given one example with nine new ex-amples drawn by BPL (Fig. 5). We evaluated eachmodel based on the accuracy of the judges, whichwe call their identification (ID) level: Idealmodelperformance is 50% ID level, indicating that theycannot distinguish the model’s behavior fromhumans; worst-case performance is 100%. Eachjudge (N = 147) completed 49 trials with blockedfeedback, and judges were analyzed individuallyand in aggregate. The results are shown in Fig.6B (new exemplars). Judges had only a 52% IDlevel on average for discriminating human versusBPL behavior. As a group, this performance wasbarely better than chance [t(47) = 2.03, P = 0.048],and only 3 of 48 judges had an ID level reliablyabove chance. Three lesioned models were eval-uated by different groups of judges in separate

SCIENCE sciencemag.org 11 DECEMBER 2015 • VOL 350 ISSUE 6266 1335

1 2

1 2

1 2

1 2

1 2

1 2

Human or Machine?

Fig. 5. Generating new exemplars. Humans and machines were given an image of a novel character(top) and asked to produce new exemplars.The nine-character grids in each pair that were generated bya machine are (by row) 1, 2; 2, 1; 1, 1.

RESEARCH | RESEARCH ARTICLES

the pen (Fig. 3A, ii). To construct a new charactertype, first themodel samples the number of partsk and the number of subparts ni, for each parti = 1, ..., k, from their empirical distributions as

measured from the background set. Second, atemplate for a part Si is constructed by samplingsubparts from a set of discrete primitive actionslearned from the background set (Fig. 3A, i),

such that the probability of the next actiondepends on the previous. Third, parts are thengrounded as parameterized curves (splines) bysampling the control points and scale parameters

1334 11 DECEMBER 2015 • VOL 350 ISSUE 6266 sciencemag.org SCIENCE

Fig. 3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i),combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by runningthese programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I(m) for m = 1, ..., M. Thefunction f (·, ·) transforms a subpart sequence and start location into a trajectory.

Human parses Machine parsesHuman drawings

-505 -593 -655 -695 -723

-1794-646 -1276

Training item with model’s five best parses

Test items

1 2 3 4 5stroke order:

Fig. 4. Inferringmotor programs from images. Parts are distinguishedby color, with a colored dot indicating the beginning of a stroke and anarrowhead indicating the end. (A) The top row shows the five best pro-grams discovered for an image along with their log-probability scores(Eq. 1). Subpart breaks are shown as black dots. For classification, eachprogram was refit to three new test images (left in image triplets), andthe best-fitting parse (top right) is shown with its image reconstruction(bottom right) and classification score (log posterior predictive probability).The correctly matching test item receives a much higher classificationscore and is also more cleanly reconstructed by the best programs inducedfrom the training item. (B) Nine human drawings of three characters(left) are shown with their ground truth parses (middle) and best modelparses (right).

RESEARCH | RESEARCH ARTICLES

Page 10: (DL輪読)Matching Networks for One Shot Learning

īō¯ņ6EHone-shot learning¤ īō¯ņK¬�,�¥3$18¶6¢7E�5�¥�ßÔ#I1�H1. tay^R¯ņ2. tuyd^a}�R3. tuyd^a}�R3tay^R¯ņ7ʼn¬��Ęĩ�

Page 11: (DL輪読)Matching Networks for One Shot Learning

tay^R¯ņ6EHone-shot learning

¤ tay^R¯ņ�¤ rfxehVţĴK¯ņ&H�¥¤ 207�ĠoM3*7�7ţĴ�ŧŋ�Ġ

¤ One-shot learning<7ą¬¤ One-shotº7ŧŋ�Ġ�FţĴK¯ņ�same or different�¤ One-shot7�ĠKĤA,�Ġ37ţĴK&=1ÛA´ÜŬ7�Ġ7RxV6xnzò Siamese Neural Networks for One-shot Image Recognition

Figure 2. Our general strategy. 1) Train a model to discriminatebetween a collection of same/different pairs. 2) Generalize toevaluate new categories based on learned feature mappings forverification.

ing framework, which uses many layers of non-linearitiesto capture invariances to transformation in the input space,usually by leveraging a model with many parameters andthen using a large amount of data to prevent overfitting(Bengio, 2009; Hinton et al., 2006). These features arevery powerful because we are able to learn them withoutimposing strong priors, although the cost of the learningalgorithm itself may be considerable.

1. ApproachIn general, we learn image representations via a supervisedmetric-based approach with siamese neural networks, thenreuse that network’s features for one-shot learning withoutany retraining.

In our experiments, we restrict our attention to characterrecognition, although the basic approach can be replicatedfor almost any modality (Figure 2). For this domain, weemploy large siamese convolutional neural networks whicha) are capable of learning generic image features usefulfor making predictions about unknown class distributionseven when very few examples from these new distribu-tions are available; b) are easily trained using standardoptimization techniques on pairs sampled from the sourcedata; and c) provide a competitive approach that does notrely upon domain-specific knowledge by instead exploitingdeep learning techniques.

To develop a model for one-shot image classification, weaim to first learn a neural network that can discriminatebetween the class-identity of image pairs, which is thestandard verification task for image recognition. We hy-pothesize that networks which do well at at verification

should generalize to one-shot classification. The verifica-tion model learns to identify input pairs according to theprobability that they belong to the same class or differ-ent classes. This model can then be used to evaluate newimages, exactly one per novel class, in a pairwise manneragainst the test image. The pairing with the highest scoreaccording to the verification network is then awarded thehighest probability for the one-shot task. If the featureslearned by the verification model are sufficient to confirmor deny the identity of characters from one set of alpha-bets, then they ought to be sufficient for other alphabets,provided that the model has been exposed to a variety ofalphabets to encourage variance amongst the learned fea-tures.

2. Related WorkOverall, research into one-shot learning algorithms is fairlyimmature and has received limited attention by the machinelearning community. There are nevertheless a few key linesof work which precede this paper.

The seminal work towards one-shot learning dates back tothe early 2000’s with work by Li Fei-Fei et al. The au-thors developed a variational Bayesian framework for one-shot image classification using the premise that previouslylearned classes can be leveraged to help forecast futureones when very few examples are available from a givenclass (Fe-Fei et al., 2003; Fei-Fei et al., 2006). More re-cently, Lake et al. approached the problem of one-shotlearning from the point of view of cognitive science, ad-dressing one-shot learning for character recognition witha method called Hierarchical Bayesian Program Learning(HBPL) (2013). In a series of several papers, the authorsmodeled the process of drawing characters generatively todecompose the image into small pieces (Lake et al., 2011;2012). The goal of HBPL is to determine a structural ex-planation for the observed pixels. However, inference un-der HBPL is difficult since the joint parameter space is verylarge, leading to an intractable integration problem.

Some researchers have considered other modalities ortransfer learning approaches. Lake et al. have some veryrecent work which uses a generative Hierarchical Hid-den Markov model for speech primitives combined witha Bayesian inference procedure to recognize new words byunknown speakers (2014). Maas and Kemp have some ofthe only published work using Bayesian networks to pre-dict attributes for Ellis Island passenger data (2009). Wuand Dennis address one-shot learning in the context of pathplanning algorithms for robotic actuation (2012). Lim fo-cuses on how to “borrow” examples from other classes inthe training set by adapting a measure of how much eachcategory should be weighted by each training exemplar inthe loss function (2012). This idea can be useful for data

Siamese Neural Networks for One-shot Image Recognition

Figure 2. Our general strategy. 1) Train a model to discriminatebetween a collection of same/different pairs. 2) Generalize toevaluate new categories based on learned feature mappings forverification.

ing framework, which uses many layers of non-linearitiesto capture invariances to transformation in the input space,usually by leveraging a model with many parameters andthen using a large amount of data to prevent overfitting(Bengio, 2009; Hinton et al., 2006). These features arevery powerful because we are able to learn them withoutimposing strong priors, although the cost of the learningalgorithm itself may be considerable.

1. ApproachIn general, we learn image representations via a supervisedmetric-based approach with siamese neural networks, thenreuse that network’s features for one-shot learning withoutany retraining.

In our experiments, we restrict our attention to characterrecognition, although the basic approach can be replicatedfor almost any modality (Figure 2). For this domain, weemploy large siamese convolutional neural networks whicha) are capable of learning generic image features usefulfor making predictions about unknown class distributionseven when very few examples from these new distribu-tions are available; b) are easily trained using standardoptimization techniques on pairs sampled from the sourcedata; and c) provide a competitive approach that does notrely upon domain-specific knowledge by instead exploitingdeep learning techniques.

To develop a model for one-shot image classification, weaim to first learn a neural network that can discriminatebetween the class-identity of image pairs, which is thestandard verification task for image recognition. We hy-pothesize that networks which do well at at verification

should generalize to one-shot classification. The verifica-tion model learns to identify input pairs according to theprobability that they belong to the same class or differ-ent classes. This model can then be used to evaluate newimages, exactly one per novel class, in a pairwise manneragainst the test image. The pairing with the highest scoreaccording to the verification network is then awarded thehighest probability for the one-shot task. If the featureslearned by the verification model are sufficient to confirmor deny the identity of characters from one set of alpha-bets, then they ought to be sufficient for other alphabets,provided that the model has been exposed to a variety ofalphabets to encourage variance amongst the learned fea-tures.

2. Related WorkOverall, research into one-shot learning algorithms is fairlyimmature and has received limited attention by the machinelearning community. There are nevertheless a few key linesof work which precede this paper.

The seminal work towards one-shot learning dates back tothe early 2000’s with work by Li Fei-Fei et al. The au-thors developed a variational Bayesian framework for one-shot image classification using the premise that previouslylearned classes can be leveraged to help forecast futureones when very few examples are available from a givenclass (Fe-Fei et al., 2003; Fei-Fei et al., 2006). More re-cently, Lake et al. approached the problem of one-shotlearning from the point of view of cognitive science, ad-dressing one-shot learning for character recognition witha method called Hierarchical Bayesian Program Learning(HBPL) (2013). In a series of several papers, the authorsmodeled the process of drawing characters generatively todecompose the image into small pieces (Lake et al., 2011;2012). The goal of HBPL is to determine a structural ex-planation for the observed pixels. However, inference un-der HBPL is difficult since the joint parameter space is verylarge, leading to an intractable integration problem.

Some researchers have considered other modalities ortransfer learning approaches. Lake et al. have some veryrecent work which uses a generative Hierarchical Hid-den Markov model for speech primitives combined witha Bayesian inference procedure to recognize new words byunknown speakers (2014). Maas and Kemp have some ofthe only published work using Bayesian networks to pre-dict attributes for Ellis Island passenger data (2009). Wuand Dennis address one-shot learning in the context of pathplanning algorithms for robotic actuation (2012). Lim fo-cuses on how to “borrow” examples from other classes inthe training set by adapting a measure of how much eachcategory should be weighted by each training exemplar inthe loss function (2012). This idea can be useful for data

Page 12: (DL輪読)Matching Networks for One Shot Learning

Siamese Network7Ŭ¤ Siamese NetworkKŬ$,one-shot learning [Koch+ 2015]

¤ Siamese Network [Bromlay+ 1993]¤ oM7*I+I6d^a}�RK¬¸$1Ó�¤ ţĴK� $1ŧŋ�Ġ7xnz�ţĴ�37ŠğK¯ņ

Siamese Neural Networks for One-shot Image Recognition

Figure 3. A simple 2 hidden layer siamese network for binaryclassification with logistic prediction p. The structure of the net-work is replicated across the top and bottom sections to form twinnetworks, with shared weight matrices at each layer.

sets where very few examples exist for some classes, pro-viding a flexible and continuous means of incorporatinginter-class information into the model.

3. Deep Siamese Networks for ImageVerification

Siamese nets were first introduced in the early 1990s byBromley and LeCun to solve signature verification as animage matching problem (Bromley et al., 1993). A siameseneural network consists of twin networks which accept dis-tinct inputs but are joined by an energy function at the top.This function computes some metric between the highest-level feature representation on each side (Figure 3). Theparameters between the twin networks are tied. Weight ty-ing guarantees that two extremely similar images could notpossibly be mapped by their respective networks to verydifferent locations in feature space because each networkcomputes the same function. Also, the network is symmet-ric, so that whenever we present two distinct images to thetwin networks, the top conjoining layer will compute thesame metric as if we were to we present the same two im-ages but to the opposite twins.

In LeCun et al., the authors used a contrastive energy func-tion which contained dual terms to decrease the energy oflike pairs and increase the energy of unlike pairs (2005).However, in this paper we use the weighted L1 distancebetween the twin feature vectors h1 and h2 combined witha sigmoid activation, which maps onto the interval [0, 1].Thus a cross-entropy objective is a natural choice for train-ing the network. Note that in LeCun et al., they directlylearned the similarity metric, which was implictly defined

by the energy loss, whereas we fix the metric as specifiedabove, following the approach in Facebook’s DeepFace pa-per (Taigman et al., 2014).

Our best-performing models use multiple convolutionallayers before the fully-connected layers and top-levelenergy function. Convolutional neural networks haveachieved exceptional results in many large-scale computervision applications, particularly in image recognition tasks(Bengio, 2009; Krizhevsky et al., 2012; Simonyan & Zis-serman, 2014; Srivastava, 2013).

Several factors make convolutional networks especially ap-pealing. Local connectivity can greatly reduce the num-ber of parameters in the model, which inherently providessome form of built-in regularization, although convolu-tional layers are computationally more expensive than stan-dard nonlinearities. Also, the convolution operation used inthese networks has a direct filtering interpretation, whereeach feature map is convolved against input features toidentify patterns as groupings of pixels. Thus, the out-puts of each convolutional layer correspond to importantspatial features in the original input space and offer somerobustness to simple transforms. Finally, very fast CUDAlibraries are now available in order to build large convolu-tional networks without an unacceptable amount of train-ing time (Mnih, 2009; Krizhevsky et al., 2012; Simonyan& Zisserman, 2014).

We now detail both the structure of the siamese nets and thespecifics of the learning algorithm used in our experiments.

3.1. Model

Our standard model is a siamese convolutional neural net-work with L layers each with N

l

units, where h1,l repre-sents the hidden vector in layer l for the first twin, and h2,l

denotes the same for the second twin. We use exclusivelyrectified linear (ReLU) units in the first L � 2 layers andsigmoidal units in the remaining layers.

The model consists of a sequence of convolutional layers,each of which uses a single channel with filters of varyingsize and a fixed stride of 1. The number of convolutionalfilters is specified as a multiple of 16 to optimize perfor-mance. The network applies a ReLU activation functionto the output feature maps, optionally followed by max-pooling with a filter size and stride of 2. Thus the kth filtermap in each layer takes the following form:

a

(k)1,m = max-pool(max(0,W

(k)l�1,l ? h1,(l�1) + b

l

), 2)

a

(k)2,m = max-pool(max(0,W

(k)l�1,l ? h2,(l�1) + b

l

), 2)

where W

l�1,l is the 3-dimensional tensor representing thefeature maps for layer l and we have taken ? to be thevalid convolutional operation corresponding to returning

Ì?8ôì

Siamese Neural Networks for One-shot Image Recognition

Figure 4. Best convolutional architecture selected for verification task. Siamese twin is not depicted, but joins immediately after the4096 unit fully-connected layer where the L1 component-wise distance between vectors is computed.

only those output units which were the result of completeoverlap between each convolutional filter and the input fea-ture maps.

The units in the final convolutional layer are flattened intoa single vector. This convolutional layer is followed bya fully-connected layer, and then one more layer com-puting the induced distance metric between each siamesetwin, which is given to a single sigmoidal output unit.More precisely, the prediction vector is given as p =

�(

Pj

j

|h(j)1,L�1 � h

(j)2,L�1|), where � is the sigmoidal

activation function. This final layer induces a metric onthe learned feature space of the (L � 1)th hidden layerand scores the similarity between the two feature vec-tors. The ↵

j

are additional parameters that are learnedby the model during training, weighting the importanceof the component-wise distance. This defines a final Lthfully-connected layer for the network which joins the twosiamese twins.

We depict one example above (Figure 4), which shows thelargest version of our model that we considered. This net-work also gave the best result for any network on the veri-fication task.

3.2. Learning

Loss function. Let M represent the minibatch size, wherei indexes the ith minibatch. Now let y(x

(i)1 , x

(i)2 ) be a

length-M vector which contains the labels for the mini-batch, where we assume y(x(i)

1 , x

(i)2 ) = 1 whenever x1 and

x2 are from the same character class and y(x

(i)1 , x

(i)2 ) = 0

otherwise. We impose a regularized cross-entropy objec-tive on our binary classifier of the following form:

L(x(i)1 , x

(i)2 ) = y(x

(i)1 , x

(i)2 ) logp(x

(i)1 , x

(i)2 )+

(1� y(x

(i)1 , x

(i)2 )) log (1� p(x

(i)1 , x

(i)2 )) + �T |w|2

Optimization. This objective is combined with standardbackpropagation algorithm, where the gradient is additiveacross the twin networks due to the tied weights. We fixa minibatch size of 128 with learning rate ⌘

j

, momentum

µ

j

, and L2 regularization weights �j

defined layer-wise, sothat our update rule at epoch T is as follows:

w

(T )kj

(x

(i)1 , x

(i)2 ) = w

(T )kj

+�w

(T )kj

(x

(i)1 , x

(i)2 ) + 2�

j

|wkj

|

�w

(T )kj

(x

(i)1 , x

(i)2 ) = �⌘

j

rw

(T )kj

+ µ

j

�w

(T�1)kj

where rw

kj

is the partial derivative with respect to theweight between the jth neuron in some layer and the kthneuron in the successive layer.

Weight initialization. We initialized all network weightsin the convolutional layers from a normal distribution withzero-mean and a standard deviation of 10�2. Biases werealso initialized from a normal distribution, but with mean0.5 and standard deviation 10

�2. In the fully-connectedlayers, the biases were initialized in the same way as theconvolutional layers, but the weights were drawn from amuch wider normal distribution with zero-mean and stan-dard deviation 2⇥ 10

�1.

Learning schedule. Although we allowed for a differentlearning rate for each layer, learning rates were decayeduniformly across the network by 1 percent per epoch, sothat ⌘(T )

j

= 0.99⌘

(T�1)j

. We found that by annealing thelearning rate, the network was able to converge to localminima more easily without getting stuck in the error sur-face. We fixed momentum to start at 0.5 in every layer,increasing linearly each epoch until reaching the value µ

j

,the individual momentum term for the jth layer.

We trained each network for a maximum of 200 epochs, butmonitored one-shot validation error on a set of 320 one-shot learning tasks generated randomly from the alphabetsand drawers in the validation set. When the validation errordid not decrease for 20 epochs, we stopped and used theparameters of the model at the best epoch according to theone-shot validation error. If the validation error continuedto decrease for the entire learning schedule, we saved thefinal state of the model generated by this procedure.

Hyperparameter optimization. We used the beta ver-sion of Whetlab, a Bayesian optimization framework, to

Page 13: (DL輪読)Matching Networks for One Shot Learning

tuyd^a}�R6EH�¥¤ ŧŋ�ĠKtuy6äĭ­�é�Ò@!3�2�I9…

¤ 1¦$�?,!35��one-shot���2BÐť$1_Va�Ġ3´BÜ��ĠK�0 H!3�2�H�

ŵNeural Turing Machine7Ŭ

¤ Neural Turing Machine (NTM) [Graves+ 2014]¤ tuy3Ŗ?é��2�Hm^bKu`z«$,cv�xzd^a}�R

Figure 1: Neural Turing Machine Architecture. During each update cycle, the controllernetwork receives inputs from an external environment and emits outputs in response. It alsoreads to and writes from a memory matrix via a set of parallel read and write heads. The dashedline indicates the division between the NTM circuit and the outside world.

2013) (Bahdanau et al., 2014) and program search (Hochreiter et al., 2001b) (Das et al.,1992), constructed with recurrent neural networks.

3 Neural Turing Machines

A Neural Turing Machine (NTM) architecture contains two basic components: a neuralnetwork controller and a memory bank. Figure 1 presents a high-level diagram of the NTMarchitecture. Like most neural networks, the controller interacts with the external world viainput and output vectors. Unlike a standard network, it also interacts with a memory matrixusing selective read and write operations. By analogy to the Turing machine we refer to thenetwork outputs that parametrise these operations as “heads.”

Crucially, every component of the architecture is differentiable, making it straightfor-ward to train with gradient descent. We achieved this by defining ‘blurry’ read and writeoperations that interact to a greater or lesser degree with all the elements in memory (ratherthan addressing a single element, as in a normal Turing machine or digital computer). Thedegree of blurriness is determined by an attentional “focus” mechanism that constrains eachread and write operation to interact with a small portion of the memory, while ignoring therest. Because interaction with the memory is highly sparse, the NTM is biased towardsstoring data without interference. The memory location brought into attentional focus isdetermined by specialised outputs emitted by the heads. These outputs define a normalisedweighting over the rows in the memory matrix (referred to as memory “locations”). Eachweighting, one per read or write head, defines the degree to which the head reads or writes

5

Page 14: (DL輪読)Matching Networks for One Shot Learning

Memory Augmented Neural Network

¤ ¯ņ¼��¤ ŧŋ�ĠKx~\s6§:ĶŒ3â�H�!IKPiZ�b3&H�

¤ ĀĶŒ�Ā�Ġ�2óśK�(,�6xnzKĵ�#(ä�6ÀÖxnzK�(H

¤ PiZ�bKŏGĐ$1&�µ1FIHE�65H�one-shot learning�2�HE�65H�

25

2

タスク設定

• この一連のプロセスを エピソード と呼ぶ • エピソードの冒頭では、番号はランダムに推定するしかない • エピソードの後半に行くにつれて、正答率が上がってくる。

• 素早く正答率が上がる = One-Shot Learning がよく出来る

2 正解! 2 1

“少数の文字例を見ただけで、すぐに認識できるようになる” というタスクを学習させたい

以下50回続く...

記憶

�IJ8ĬĸÏ�7Ŗ?�ªðEGÁ¬http://www.slideshare.net/YusukeWatanabe3/metalearning-with-memory-augmented-neural-network

Page 15: (DL輪読)Matching Networks for One Shot Learning

Matching Networks for One Shot Learning

Page 16: (DL輪読)Matching Networks for One Shot Learning

ßÔ�¥7Ml|�]_Va�6one-shot learning�2�HE�6ŧŋ�6one-shot learning7¼�Kè¾$1¯ņ$1$>��

¤ One-shot learning7¹Æ¼��¤ NRxV�ļ¹ÆKÖ�¤ $�$NRxV7Ā�Ġ8kđ�1D5�$�5��

ŵ!IKŧŋ�Bè¾&H

Page 17: (DL輪読)Matching Networks for One Shot Learning

ŧŋ�Ġ§ŝ7ĈI3¯ņ¤ `�[8one-shot learning6�J(1á7E�6§;

1. `�[X^a�F[VRKŲ0�§ŝ�0>GNRxVK§ŝ�$ĀRxV60�1kđ�1D5�7�ĠK§ŝ�L3&H�

2. *7L�FTq�aX^aS3_Va&Hg^]BK§ŝ

¤ One-shot learning7¨ĺ8(",$ %&) ∈ )3* = {"-, %-}-/01 ��/,3�S�F"& −> %&K¯ņ&H!3¤ 0>G5(%&|"&, *)K¯ņ&I9��

¤ Ġ�[VR�þJ/,��2B * −> *735H72[VR6EF'�%�œ�ê�H�

¤ 5(%&|"&, *)8cv�xzd^a}�R3â�1¯ņ2�H�

case it may be beneficial to change the function with which we embed x

i

– some evidence of thisis discussed in Section 4. We use a bidirectional Long-Short Term Memory (LSTM) [8] to encodex

i

in the context of the support set S, considered as a sequence (see appendix for a more precisedefinition).

The second issue can be fixed via an LSTM with read-attention over the whole set S, whose inputsare equal to x:

f(x, S) = attLSTM(f

0(x), g(S),K)

where f

0(x) are the features (e.g., derived from a CNN) which are input to the LSTM (constant at

each time step). K is the fixed number of unrolling steps of the LSTM, and g(S) is the set over whichwe attend, embedded with g. This allows for the model to potentially ignore some elements in thesupport set S, and adds “depth” to the computation of attention (see appendix for more details).

2.2 Training Strategy

In the previous subsection we described Matching Networks which map a support set to a classificationfunction, S ! c(x). We achieve this via a modification of the set-to-set paradigm augmented withattention, with the resulting mapping being of the form P

(.|x, S), noting that ✓ are the parametersof the model (i.e. of the embedding functions f and g described previously).

The training procedure has to be chosen carefully so as to match inference at test time. Our modelhas to perform well with support sets S0 which contain classes never seen during training.

More specifically, let us define a task T as distribution over possible label sets L. Typically weconsider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a fewexamples per class (e.g., up to 5). In this case, a label set L sampled from a task T , L ⇠ T , willtypically have 5 to 25 examples.

To form an “episode” to compute gradients and update our model, we first sample L from T (e.g.,L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B

(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained tominimise the error predicting the labels in the batch B conditioned on the support set S. This is aform of meta-learning since the training procedure explicitly learns to learn from a given support setto minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:

✓ = argmax

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

(y|x, S)

3

5

3

5. (2)

Training ✓ with eq. 2 yields a model which works well when sampling S

0 ⇠ T

0 from a differentdistribution of novel labels. Crucially, our model does not need any fine tuning on the classes it hasnever seen due to its non-parametric nature. Obviously, as T 0 diverges far from the T from which wesampled to learn ✓, the model will not work – we belabor this point further in Section 4.1.2.

3 Related Work

3.1 Memory Augmented Neural Networks

A recent surge of models which go beyond “static” classification of fixed vectors onto their classeshas reshaped current research and industrial applications alike. This is most notable in the massiveadoption of LSTMs [8] in a variety of tasks such as speech [7], translation [23, 2] or learning programs[4, 27]. A key component which allowed for more expressive models was the introduction of “content”based attention in [2], and “computer-like” architectures such as the Neural Turing Machine [4] orMemory Networks [29]. Our work takes the metalearning paradigm of [21], where an LSTM learntto learn quickly from data presented sequentially, but we treat the data as a set. The one-shot learningtask we defined on the Penn Treebank [15] relates to evaluation techniques and models presented in[6], and we discuss this in Section 4.

3.2 Metric Learning

As discussed in Section 2, there are many links between content based attention, kernel based nearestneighbor and metric learning [1]. The most relevant work is Neighborhood Component Analysis

4

Page 18: (DL輪読)Matching Networks for One Shot Learning

Matching Networks¤ 5 %& "&, * K¯ņ&Hcv�xzd^a}�R3$1Matching

networksKßÔ&H¤ _Va�Bäĭone-shot learningK��!3�2�H�end-to-end�

Figure 1: Matching Networks architecture

train it by showing only a few examples per class, switching the task from minibatch to minibatch,much like how it will be tested when presented with a few examples of a new task.

Besides our contributions in defining a model and training criterion amenable for one-shot learning,we contribute by the definition of tasks that can be used to benchmark other approaches on bothImageNet and small scale language modeling. We hope that our results will encourage others to workon this challenging problem.

We organized the paper by first defining and explaining our model whilst linking its several compo-nents to related work. Then in the following section we briefly elaborate on some of the related workto the task and our model. In Section 4 we describe both our general setup and the experiments weperformed, demonstrating strong results on one-shot learning on a variety of tasks and setups.

2 Model

Our non-parametric approach to solving one-shot learning is based on two components which wedescribe in the following subsections. First, our model architecture follows recent advances in neuralnetworks augmented with memory (as discussed in Section 3). Given a (small) support set S, ourmodel defines a function c

S

(or classifier) for each S, i.e. a mapping S ! c

S

(.). Second, we employa training strategy which is tailored for one-shot learning from the support set S.

2.1 Model Architecture

In recent years, many groups have investigated ways to augment neural network architectures withexternal memories and other components that make them more “computer-like”. We draw inspirationfrom models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] andpointer networks [27].

In all these models, a neural attention mechanism, often fully differentiable, is defined to access (orread) a memory matrix which stores useful information to solve the task at hand. Typical uses ofthis include machine translation, speech recognition, or question answering. More generally, thesearchitectures model P (B|A) where A and/or B can be a sequence (like in seq2seq models), or, moreinterestingly for us, a set [26].

Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].The key point is that when trained, Matching Networks are able to produce sensible test labels forunobserved classes without any changes to the network. More precisely, we wish to map from a(small) support set of k examples of image-label pairs S = {(x

i

, y

i

)}ki=1 to a classifier c

S

(x) which,given a test example x, defines a probability distribution over outputs y. We define the mappingS ! c

S

(x) to be P (y|x, S) where P is parameterised by a neural network. Thus, when given a

2

Tq�aX^a S

_Va�Ġ "&

� %&

Page 19: (DL輪読)Matching Networks for One Shot Learning

Matching Networks7Öũ¤ Matching networkKńš«&H3á7õ7E�65H

¤ a8Q�dz3â�H!3�2�H,AQ�dzŕ¦ı�3â�H!3�2�H

¤ >,nearest-neighbor3â�H!3B2�H

¤ cv�xzd^a}�R3$1â�H3neural machine translation6� Halignment model6�ą&H¤ [Bahdanau+ 2016]¤ a8M_~Uw~tQcWsy8memories bound

new support set of examples S0 from which to one-shot learn, we simply use the parametric neuralnetwork defined by P to make predictions about the appropriate label y for each test example x:P (y|x, S0

). In general, our predicted output class for a given input unseen example x and a supportset S becomes argmax

y

P (y|x, S).Our model in its simplest form computes y as follows:

y =

kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i

are the samples and labels from the support set S = {(xi

, y

i

)}ki=1, and a is an attention

mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class asa linear combination of the labels in the support set. Where the attention mechanism a is a kernel onX ⇥X , then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for theb furthest x

i

from x according to some distance metric and an appropriate constant otherwise, then(1) is equivalent to ‘k � b’-nearest neighbours (although this requires an extension to the attentionmechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods.Another view of (1) is where a acts as an attention mechanism and the y

i

act as memories bound tothe corresponding x

i

. In this case we can understand this as a particular kind of associative memorywhere, given an input, we “point” to the corresponding example in the support set, retrieving its label.However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as thesupport set size grows, so does the memory used. Hence the functional form defined by the classifierc

S

(x) is very flexible and can adapt easily to any new support set.

2.1.1 The Attention Kernel

Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies the classi-fier. The simplest form that this takes (and which has very tight relationships with commonattention models and kernel functions) is to use the softmax over the cosine distance c, i.e.,a(x, x

i

) = e

c(f(x),g(xi))/

Pk

j=1 ec(f(x),g(xj)) with embedding functions f and g being appropri-

ate neural networks (potentially with f = g) to embed x and x

i

. In our experiments we shall seeexamples where f and g are parameterised variously as deep convolutional networks for imagetasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language tasks (seeSection 4).

We note that, though related to metric learning, the classifier defined by Equation 1 is discriminative.For a given support set S and sample to classify x, it is enough for x to be sufficiently aligned withpairs (x0

, y

0) 2 S such that y0 = y and misaligned with the rest. This kind of loss is also related to

methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or large marginnearest neighbor [28].

However, the objective that we are trying to optimize is precisely aligned with multi-way, one-shotclassification, and thus we expect it to perform better than its counterparts. Additionally, the loss issimple and differentiable so that one can find the optimal parameters in an “end-to-end” fashion.

2.1.2 Full Context Embeddings

The main novelty of our model lies in reinterpreting a well studied framework (neural networks withexternal memories) to do one-shot learning. Closely related to metric learning, the embedding func-tions f and g act as a lift to feature space X to achieve maximum accuracy through the classificationfunction described in eq. 1.

Despite the fact that the classification strategy is fully conditioned on the whole support set throughP (.|x, S), the embeddings on which we apply the cosine similarity to “attend”, “point” or simplycompute the nearest neighbor are myopic in the sense that each element x

i

gets embedded by g(x

i

)

independently of other elements in the support set S. Furthermore, S should be able to modify howwe embed the test image x through f .

We propose embedding the elements of the set through a function which takes as input the full setS in addition to x

i

, i.e. g becomes g(xi

, S). Thus, as a function of the whole support set S, g canmodify how to embed x

i

. This could be useful when some element xj

is very close to x

i

, in which

3

Page 20: (DL輪読)Matching Networks for One Shot Learning

d^a}�Rďė¤ a�STO~ţĴc�7softmax

¤ g�ŧŋ`�[7ġś��bidirectional RNN

¤ f�_Va`�[7ġś��M_~Uw~LSTM

¤ ,-$ 8cv�xzd^a}�R�VGGDInception�

Figure 1: Matching Networks architecture

train it by showing only a few examples per class, switching the task from minibatch to minibatch,much like how it will be tested when presented with a few examples of a new task.

Besides our contributions in defining a model and training criterion amenable for one-shot learning,we contribute by the definition of tasks that can be used to benchmark other approaches on bothImageNet and small scale language modeling. We hope that our results will encourage others to workon this challenging problem.

We organized the paper by first defining and explaining our model whilst linking its several compo-nents to related work. Then in the following section we briefly elaborate on some of the related workto the task and our model. In Section 4 we describe both our general setup and the experiments weperformed, demonstrating strong results on one-shot learning on a variety of tasks and setups.

2 Model

Our non-parametric approach to solving one-shot learning is based on two components which wedescribe in the following subsections. First, our model architecture follows recent advances in neuralnetworks augmented with memory (as discussed in Section 3). Given a (small) support set S, ourmodel defines a function c

S

(or classifier) for each S, i.e. a mapping S ! c

S

(.). Second, we employa training strategy which is tailored for one-shot learning from the support set S.

2.1 Model Architecture

In recent years, many groups have investigated ways to augment neural network architectures withexternal memories and other components that make them more “computer-like”. We draw inspirationfrom models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] andpointer networks [27].

In all these models, a neural attention mechanism, often fully differentiable, is defined to access (orread) a memory matrix which stores useful information to solve the task at hand. Typical uses ofthis include machine translation, speech recognition, or question answering. More generally, thesearchitectures model P (B|A) where A and/or B can be a sequence (like in seq2seq models), or, moreinterestingly for us, a set [26].

Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].The key point is that when trained, Matching Networks are able to produce sensible test labels forunobserved classes without any changes to the network. More precisely, we wish to map from a(small) support set of k examples of image-label pairs S = {(x

i

, y

i

)}ki=1 to a classifier c

S

(x) which,given a test example x, defines a probability distribution over outputs y. We define the mappingS ! c

S

(x) to be P (y|x, S) where P is parameterised by a neural network. Thus, when given a

2

Appendix

A Model Description

In this section we fully specify the models which condition the embedding functions f and g on thewhole support set S. Much previous work has fully described similar mechanisms, which is why weleft the precise details for this appendix.

A.1 The Fully Conditional Embedding f

As described in section 2.1.2, the embedding function for an example x in the batch B is as follows:

f(x, S) = attLSTM(f

0(x), g(S),K)

where f

0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K

to be the number of “processing” steps following work from [26] from their “Process” block. g(S)represents the embedding function g applied to each element x

i

from the set S.

Thus, the state after k processing steps is as follows:

ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)

noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”based attention, and the softmax in eq. 6 normalizes w.r.t. g(x

i

). The read-out rk�1 from g(S) is

concatenated to h

k�1. Since we do K steps of “reads”, attLSTM(f

0(x), g(S),K) = h

K

where h

k

is as described in eq. 3.

A.2 The Fully Conditional Embedding g

In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi

, S),as a bidirectional LSTM. More precisely, let g0(x

i

) be a neural network (similar to f

0 above, e.g. aVGG or Inception model). Then we define g(x

i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)

where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x

the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~

h

starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.

B ImageNet Class Splits

Here we define the two class splits used in our full ImageNet experiments – these classes wereexcluded for training during our one-shot experiments described in section 4.1.2.

L

rand

=

n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n01775062,n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n02097298,

10

Appendix

A Model Description

In this section we fully specify the models which condition the embedding functions f and g on thewhole support set S. Much previous work has fully described similar mechanisms, which is why weleft the precise details for this appendix.

A.1 The Fully Conditional Embedding f

As described in section 2.1.2, the embedding function for an example x in the batch B is as follows:

f(x, S) = attLSTM(f

0(x), g(S),K)

where f

0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K

to be the number of “processing” steps following work from [26] from their “Process” block. g(S)represents the embedding function g applied to each element x

i

from the set S.

Thus, the state after k processing steps is as follows:

ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)

noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”based attention, and the softmax in eq. 6 normalizes w.r.t. g(x

i

). The read-out rk�1 from g(S) is

concatenated to h

k�1. Since we do K steps of “reads”, attLSTM(f

0(x), g(S),K) = h

K

where h

k

is as described in eq. 3.

A.2 The Fully Conditional Embedding g

In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi

, S),as a bidirectional LSTM. More precisely, let g0(x

i

) be a neural network (similar to f

0 above, e.g. aVGG or Inception model). Then we define g(x

i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)

where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x

the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~

h

starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.

B ImageNet Class Splits

Here we define the two class splits used in our full ImageNet experiments – these classes wereexcluded for training during our one-shot experiments described in section 4.1.2.

L

rand

=

n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n01775062,n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n02097298,

10

new support set of examples S0 from which to one-shot learn, we simply use the parametric neuralnetwork defined by P to make predictions about the appropriate label y for each test example x:P (y|x, S0

). In general, our predicted output class for a given input unseen example x and a supportset S becomes argmax

y

P (y|x, S).Our model in its simplest form computes y as follows:

y =

kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i

are the samples and labels from the support set S = {(xi

, y

i

)}ki=1, and a is an attention

mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class asa linear combination of the labels in the support set. Where the attention mechanism a is a kernel onX ⇥X , then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for theb furthest x

i

from x according to some distance metric and an appropriate constant otherwise, then(1) is equivalent to ‘k � b’-nearest neighbours (although this requires an extension to the attentionmechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods.Another view of (1) is where a acts as an attention mechanism and the y

i

act as memories bound tothe corresponding x

i

. In this case we can understand this as a particular kind of associative memorywhere, given an input, we “point” to the corresponding example in the support set, retrieving its label.However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as thesupport set size grows, so does the memory used. Hence the functional form defined by the classifierc

S

(x) is very flexible and can adapt easily to any new support set.

2.1.1 The Attention Kernel

Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies the classi-fier. The simplest form that this takes (and which has very tight relationships with commonattention models and kernel functions) is to use the softmax over the cosine distance c, i.e.,a(x, x

i

) = e

c(f(x),g(xi))/

Pk

j=1 ec(f(x),g(xj)) with embedding functions f and g being appropri-

ate neural networks (potentially with f = g) to embed x and x

i

. In our experiments we shall seeexamples where f and g are parameterised variously as deep convolutional networks for imagetasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language tasks (seeSection 4).

We note that, though related to metric learning, the classifier defined by Equation 1 is discriminative.For a given support set S and sample to classify x, it is enough for x to be sufficiently aligned withpairs (x0

, y

0) 2 S such that y0 = y and misaligned with the rest. This kind of loss is also related to

methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or large marginnearest neighbor [28].

However, the objective that we are trying to optimize is precisely aligned with multi-way, one-shotclassification, and thus we expect it to perform better than its counterparts. Additionally, the loss issimple and differentiable so that one can find the optimal parameters in an “end-to-end” fashion.

2.1.2 Full Context Embeddings

The main novelty of our model lies in reinterpreting a well studied framework (neural networks withexternal memories) to do one-shot learning. Closely related to metric learning, the embedding func-tions f and g act as a lift to feature space X to achieve maximum accuracy through the classificationfunction described in eq. 1.

Despite the fact that the classification strategy is fully conditioned on the whole support set throughP (.|x, S), the embeddings on which we apply the cosine similarity to “attend”, “point” or simplycompute the nearest neighbor are myopic in the sense that each element x

i

gets embedded by g(x

i

)

independently of other elements in the support set S. Furthermore, S should be able to modify howwe embed the test image x through f .

We propose embedding the elements of the set through a function which takes as input the full setS in addition to x

i

, i.e. g becomes g(xi

, S). Thus, as a function of the whole support set S, g canmodify how to embed x

i

. This could be useful when some element xj

is very close to x

i

, in which

3

Appendix

A Model Description

In this section we fully specify the models which condition the embedding functions f and g on thewhole support set S. Much previous work has fully described similar mechanisms, which is why weleft the precise details for this appendix.

A.1 The Fully Conditional Embedding f

As described in section 2.1.2, the embedding function for an example x in the batch B is as follows:

f(x, S) = attLSTM(f

0(x), g(S),K)

where f

0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K

to be the number of “processing” steps following work from [26] from their “Process” block. g(S)represents the embedding function g applied to each element x

i

from the set S.

Thus, the state after k processing steps is as follows:

ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)

noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”based attention, and the softmax in eq. 6 normalizes w.r.t. g(x

i

). The read-out rk�1 from g(S) is

concatenated to h

k�1. Since we do K steps of “reads”, attLSTM(f

0(x), g(S),K) = h

K

where h

k

is as described in eq. 3.

A.2 The Fully Conditional Embedding g

In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi

, S),as a bidirectional LSTM. More precisely, let g0(x

i

) be a neural network (similar to f

0 above, e.g. aVGG or Inception model). Then we define g(x

i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)

where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x

the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~

h

starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.

B ImageNet Class Splits

Here we define the two class splits used in our full ImageNet experiments – these classes wereexcluded for training during our one-shot experiments described in section 4.1.2.

L

rand

=

n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n01775062,n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n02097298,

10

Page 21: (DL輪読)Matching Networks for One Shot Learning

Set-to-set j{�s}�R¤ ¡Ď7seq2seq28ĶŒ7Ŋę�âř#I1$>�

¤ Matching network6��18Ŋę8â�5��0>Gû�3$1â�H�E�6$,�

¤ Order Matters: Sequence to sequence for sets [Vinyals+ 2015]¤ Ţ���āí3�%¤ Seq2seq6��1ŊęKâř$5�E�6¯ņ&H

Published as a conference paper at ICLR 2016

All these empirical findings point to the same story: often for optimization purposes, the order inwhich input data is shown to the model has an impact on the learning performance.

Note that we can define an ordering which is independent of the input sequence or set X (e.g., alwaysreversing the words in a translation task), but also an ordering which is input dependent (e.g., sortingthe input points in the convex hull case). This distinction also applies in the discussion about outputsequences and sets in Section 5.1.

Recent approaches which pushed the seq2seq paradigm further by adding memory and computationto these models allowed us to define a model which makes no assumptions about input ordering,whilst preserving the right properties which we just discussed: a memory that increases with thesize of the set, and which is order invariant. In the next sections, we explain such a modification,which could also be seen as a special case of a Memory Network (Weston et al., 2015) or NeuralTuring Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.

4.2 ATTENTION MECHANISMS

Neural models with memories coupled to differentiable addressing mechanism have been success-fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,2015). Since we are interested in associative memories we employed a “content” based attention.This has the property that the vector retrieved from our memory would not change if we randomlyshuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,our process block based on an attention mechanism uses the following:

qt = LSTM(q⇤t�1) (3)ei,t = f(mi, qt) (4)

ai,t =

exp(ei,t)Pj exp(ej,t)

(5)

rt =

X

i

ai,tmi (6)

q⇤t = [qt rt] (7)

Read

Process Write

Figure 1: The Read-Process-and-Write model.

where i indexes through each memory vector mi (typically equal to the cardinality of X), qt isa query vector which allows us to read rt from the memories, f is a function that computes asingle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes arecurrent state but which takes no inputs. q⇤t is the state which this LSTM evolves, and is formedby concatenating the query qt with the resulting attention readout rt. t is the index which indicateshow many “processing steps” are being carried to compute the state to be fed to the decoder. Notethat permuting mi and mi0 has no effect on the read vector rt.

4.3 READ, PROCESS, WRITE

Our model, which naturally handles input sets, has three components (the exact equations and im-plementation will be released in an appendix prior to publication):

• A reading block, which simply embeds each element xi 2 X using a small neural networkonto a memory vector mi (the same neural network is used for all i).

• A process block, which is an LSTM without inputs or outputs performing T steps of com-putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-edly using the attention mechanism described in the previous section. At the end of thisblock, its hidden state q⇤T is an embedding which is permutation invariant to the inputs. Seeeqs. (3)-(7) for more details.

4

Page 22: (DL輪読)Matching Networks for One Shot Learning

©ě¼�¤ N-way k-shot learning[VR

¤ One-shot¯ņ2E�êJIH[VR¤ NRxV7*I+I60�1kđ7�Ġ$�Č�FI5�¼�¤ NRxV60�18¯ņ('*Iº7`�[Kê/1¯ņ$_Va�ĠKNRxV7�.74I�6�ļ&H�x~\s5��7ÀÖĆ81/N�¤ ©ě�27fine-tuning¼�28NRxV60�1B¯ņ&H��÷¯ņ6Ē¸�ü·�

Page 23: (DL輪読)Matching Networks for One Shot Learning

©ě1:óś�ļ[VR¤ `�[X^a�Omniglot

¤ 1623RxV7íĥĀRxV7�Ġ820ş

¤ ĉū�¥¤ Pixels�óś`�[2nearest neighbor ¤ Baseline�CNN7øŜĜ2nearest neighbor

¤ �ļ&HNRxV8øŜű�7ѯņ$5��Ń7�¥3Ŵ�H,A�¤ MANN¤ Siamese network

Page 24: (DL輪読)Matching Networks for One Shot Learning

©ě��Óë

¤ ßÔ�¥�´BĊ�Óë¤ Fine-tuning8ßÔ�¥60�18�>GĦë�5�

¤ 繿�¤ ĉū�¥6LakeF7ĘĩÓë�5)5� (by KarpathyÝ)

¤ 1-shot 20-way295.2%3��Óë[Lake+ 2011]

Page 25: (DL輪読)Matching Networks for One Shot Learning

©ě2ÂŤ >3AH76 Ũ�>$,���

Page 26: (DL輪読)Matching Networks for One Shot Learning

*7Ń7Ęĩ¤ īō�»u`zKŬ$,One-shot generalization [Rezende+

2016]¤ M_~Uw~tQcWsK�I,VAEKßÔ

¤ One-shot generation7Ġ

One-shot Generalization in Deep Generative Models

xc

t�1

z

t�1

h

t�1

A

fw

fc

A fw

fo

h

T

c

T

Generative model

z

T

A

h

t�1

x

z

t

fr

Inference model

(a) Unconditional generative model.

x

A fw

fo

h

T

c

Tx’

h

T�1

A

Generative model

z

T

A

h

t�1

x

fr

x’

A

z

t

Inference model

(b) One-step of the conditional generative model.Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models.A represents an attentional mechanism that uses function fw for writings and function fr for reading.

and our transition is specified as a long short-term mem-ory network (LSTM, Hochreiter & Schmidhuber (1997).We explicitly represent the creation of a set of hidden vari-ables c

t

that is a hidden canvas of the model (equation (6)).The canvas function f

c

allows for many different trans-formations, and it is here where generative (writing) at-tention is used; we describe a number of choices for thisfunction in section 3.2.3. The generated image (7) is sam-pled using an observation function f

o

(c; ✓o

) that maps thelast hidden canvas c

T

to the parameters of the observationmodel. The set of all parameters of the generative model is✓ = {✓

h

, ✓c

, ✓o

}.

3.2.2. FREE ENERGY OBJECTIVE

Given the probabilistic model (3)-(7) we can obtain an ob-jective function for inference and parameter learning usingvariational inference. By applying the variational principle,we obtain the free energy objective:

log p(x) = log

Rp(x|z1:T )p(z1:T )dz1:T � F

F = Eq(z1:T )[log p

(x|z1:T )]

�P

T

t=1 KL[q�

(z

t

|z<t

x)kp(zt

)], (8)

where z

<t

indicates the collection of all latent variablesfrom step 1 to t � 1. We can now optimize this objec-tive function for the variational parameters � and the modelparameters ✓, by stochastic gradient descent using a mini-batch of data. As with other VAEs, we use a single sampleof the latent variables generated from q

(z|x) when com-puting the Monte Carlo gradient. To complete our specifi-cation, we now specify the hidden-canvas functions f

c

andthe approximate posterior distribution q

(z

t

).

3.2.3. HIDDEN CANVAS FUNCTIONS

The canvas transition function fc

(c

t�1,ht

; ✓c

) (6) updatesthe hidden canvas by first non-linearly transforming thecurrent hidden state of the LSTM h

t

(using a function fw

)and fuses the result with the existing canvas c

t�1. In thiswork we use hidden canvases that have the same size asthe original images, though they could be either larger or

smaller in size and can have any number of channels (fourin this paper). We consider two ways with which to updatethe hidden canvas:

Additive Canvas. As the name implies, an additive canvasupdates the canvas by simply adding a transformation of thehidden state f

w

(h

t

; ✓c

) to the previous canvas state c

t�1.This is a simple, yet effective (see results) update rule:

fc

(c

t�1,ht

; ✓c

) = c

t�1 + fw

(h

t

; ✓c

), (9)

Gated Recurrent Canvas. The canvas function can be up-dated using a convolutional gated recurrent unit (CGRU)architecture (Kaiser & Sutskever, 2015), which provides anon-linear and recursive updating mechanism for the can-vas and are simplified versions of convolutional LSTMs(further details of the CGRU are given in appendix B). Thecanvas update is:

fc

(c

t�1,ht

; ✓c

) = CGRU(c

t�1 + fw

(h

t

; ✓c

)) (10)

In both cases, the function fw

(h

t

; ✓w

) is a writing or gen-

erative attention function, that we implement as a spatialtransformer; inputs to the spatial transformer are its affineparameters and a 10 ⇥ 10 image to be transformed, both ofwhich are provided by the LSTM output.

The final phase of the generative process transforms thehidden canvas at the last time step c

T

into the parameters ofthe likelihood function using the output function f

o

(c; ✓o

).Since we use a hidden canvas that is the same size as theoriginal images but that have a different number of filters,we implement the output function as a 1 ⇥ 1 convolution.When the hidden canvas has a different size, a convolu-tional network is used instead.

3.2.4. DEPENDENT POSTERIOR INFERENCE

We use a structured posterior approximation that has anauto-regressive form, i.e. q(z

t

|z<t

,x). We implement thisdistribution as an inference network parameterized by adeep network. The specific form we use is:

Sprite r

t

= fr

(x,ht�1;�r

) (11)Sample z

t

⇠N (z

t

|µ(st

,ht�1;�µ

),�(rt

,ht�1;��

)) (12)

One-shot Generalization in Deep Generative Models

Figure 8. Unconditional samples for 52 ⇥ 52 omniglot (task 1).For a video of the generation process, see https://www.youtube.com/

watch?v=HQEI2xfTgm4

Figure 9. Generating new examplars of a given character for theweak generalization test (task 2a). The first row shows the testimages and the next 10 are one-shot samples from the model.

3. Representative samples from a novel alphabet.This task corresponds to figure 7 in Lake et al. (2015), andconditions the model on anywhere between 1 to 10 samplesof a novel alphabet and asks the model to generate new

characters consistent with this novel alphabet. We showhere the hardest form of this test, using only 1 context im-age. This test is highly subjective, but the model genera-tions in figure 11 show that it is able to pick up commonfeatures and use them in the generations.

We have emphasized the usefulness of deep generativemodels as scalable, general-purpose tools for probabilisticreasoning that have the important property of one-shot gen-eralization. But, these models do have limitations. We havealready pointed to the need for reasonable amounts of data.Another important consideration is that, while our modelscan perform one-shot generalization, they do not performone-shot learning. One-shot learning requires that a modelis updated after the presentation of each new input, e.g.,like the non-parametric models used by Lake et al. (2015)or Salakhutdinov et al. (2013). Parametric models such asours require a gradient update of the parameters, which wedo not do. Instead, our model performs a type of one-shotinference that during test time can perform inferential taskson new data points, such as missing data completion, newexemplar generation, or analogical sampling, but does notlearn from these points. This distinction between one-shotlearning and inference is important and affects how suchmodels can be used. We aim to extend our approach to theonline and one-shot learning setting in future.

30-20 40-10 45-5

Figure 10. Generating new examplars of a given character for thestrong generalization test (task 2b,c), with models trained withdifferent amounts of data. Left: Samples from model trained on30-20 train-test split; Middle: 40-10 split; Right: 45-5 split (right)

Figure 11. Generating new exemplars from a novel alphabet (task3). The first row shows the test images, and the next 10 rows areone-shot samples generated by the model.

6. ConclusionWe have developed a new class of general-purpose mod-els that have the ability to perform one-shot generalization,emulating an important characteristic of human cognition.Sequential generative models are natural extensions of vari-ational auto-encoders and provide state-of-the-art modelsfor deep density estimation and image generation. Themodels specify a sequential process over groups of latentvariables that allows it to compute the probability of datapoints over a number of steps, using the principles of feed-back and attention. The use of spatial attention mechanismssubstantially improves the ability of the model to general-ize. The spatial transformer is a highly flexible attentionmechanism for both reading and writing, and is now ourdefault mechanism for attention in generative models. Wehighlighted the one-shot generalization ability of the modelover a range of tasks that showed that the model is able togenerate compelling and diverse samples, having seen newexamples just once. However there are limitations of thisapproach, e.g., still needing a reasonable amount of data toavoid overfitting, which we hope to address in future work.

Page 27: (DL輪読)Matching Networks for One Shot Learning

>3A¤ One-shot learning7ŭ·

¤ Zero-shot learningDĄĔ¯ņ5437ē�

¤ Ï�Ęĩ¤ īō¯ņ�8ľōnOW7�¥�¶Ĉ¤ īō¯ņ�8¢7307�¥��H

1. tay^R¯ņ2. tuyd^a}�R3. tuyd^a}�R3tay^R¯ņ7ʼn¬

¤ ¤�837�¥60�1ŪĢ$,¤ Matching Networks6E/1end-to-end2¯ņ&H�¥

¤ ùĞ�¤ öă��6�J5�1"AL5#�

Page 28: (DL輪読)Matching Networks for One Shot Learning

Ăâ팤 Ăâ6$,TOa

¤ https://www.quora.com/How-is-one-shot-learning-different-from-deep-learning#

¤ https://www.quora.com/What-is-the-difference-between-one-shot-learning-and-transfer-learning

¤ KarpathyÝ7>3A�https://github.com/karpathy/paper-notes/blob/master/matching_networks.md�

¤ āí¤ ,�#L�Á¬$,��8VxOb�6£Ú$1�>&�