a survey on meta-learning - shawn li's blog · meta-learning with memory augmented neural...
Post on 28-May-2020
4 Views
Preview:
TRANSCRIPT
A Survey on Meta-Learning
Xiang Li
Nanyang Technological University
September 10, 2019
1 / 53
What’s Not Included in This Survey?
I Meta-Learning for Reinforcement Learning
I Beyesian-Based Meta-Learning
2 / 53
Overview
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
3 / 53
Problem Definition
I Over a task distribution p(T ):
θ∗ ← argminθ
∑t∼p(T )
Lt(θ)
I Common tasks:I Few-shot LearningI RegressionI Reinforcement Learning
4 / 53
Few-shot Learning: Definition
I Training set Dmeta−train = {(x1, y1), ..., (xk, yk)}I Test set Dmeta−test = {D1, ..., Dn},Di = {(xi1, yi1), ..., (xim, yim)}
I N-way, K-shot problem (usually 5-way, 5-shot or 5-way,1-shot)
Support Set (Label Given)
Query Set(To Predict Label)
Ravi et al. ’17
5 / 53
Few-shot Learning: Dataset
I Omniglot dataset (Lake et al. ’15)I 1623 characters, 20 instances of each characterI ”transpose” of MNIST
I Mini-ImageNetI subset of ImageNet
I Other datasets: CIFAR, CUB, CelebA and others
6 / 53
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
7 / 53
Non-Parametric Methods (Metric Learning)
I Idea:Simply compare query images to images in support set infeature space
I Challenge:I Feature space (compare in what space?)I Distance metric
8 / 53
Siamese Network (Koch et al. ’15)
9 / 53
I Can we make the training condition match the test condition?
I Can the feature space be conditioned on specific task?
10 / 53
Matching Networks for One Shot Learning (Vinyals et al.’16)
I Sampling Strategy:test and trainingconditions mustmatchI Sample ”episodes”
in a training batch
I Attention KernelI y =∑k
i=1 a (x, xi) yiI a (x, xi) =
ec(f(x),g(xi))∑kj=1 e
c(f(x),g(xj))
I Full ContextEmbeddings
𝑔𝜃 is a bidirectional LSTM to encode 𝑥𝑖in context of 𝑆
𝑓𝜃 is a LSTM to encode 𝑥 in context of 𝑆
11 / 53
Other Metric-Based Works
I Prototypical Networks for Few-shot Learning (Snell et al. ’17)
I ck = 1|Sk|
∑(xi,yi)∈Sk fφ (xi)
I pφ(y = k|x) = exp(−d(fφ(x),ck))∑k′ exp(−d(fφ(x),ck′ ))
I Learning to Compare: Relation Network for Few-ShotLearning (Sung et al. ’18)
12 / 53
Performance
Omniglot
Mini-ImageNet
13 / 53
Metric-Based Methods: Summary
I Meta Knowledge: feature space & distance metric
I Task Knowledge: None
I Advantage:I Simple and computational fastI Entirely feed-forward
I Disadvantage:I Hard to scale very large KI Limited to classification problem
14 / 53
Problem on Metric-Based Methods
Adaptation for each specific task?
I Model-Based Methods
I Optimization-Based Methods
15 / 53
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
16 / 53
Model-Based Methods (Black-Box Adaptation)
I Idea:I Adaptation φi for task iI Train a network (θ) to represent p
(φi|Dsupport
i , θ)
17 / 53
Model-Based Methods (Black-Box Adaptation)
How to aggregate samples insupport set (form of fθ)?
I Average (PrototypicalNetwork)
I LSTM (Matching Network)
I Memory-Augmented NeuralNetworks
18 / 53
Neural Turing Machines (NTMs) (Graves et al. ’14)
I Attention-based
I Memory matrix M
I Read:
{r←−
∑iw(i)M(i)∑
iw(i) = 1
I Write:{Mt(i)←−Mt−1(i) [1− w(i)e]Mt(i)←− Mt(i) + w(i)a
19 / 53
Neural Turing Machines (NTMs) (Graves et al. ’14)
Advantages:
I Large memory
I Addressable
20 / 53
Target: p(φi|Dsupport
i , θ)
What formations can φi be?
I Contain task-specific information
I Contain only useful information
Solution 1: feature representations of samples in Dsupport
21 / 53
Meta-Learning with Memory Augmented Neural Networks(Santoro et al. ’16)
I Read Memory:
ri =
N∑i=1
wrt (i)Mt(i),
where
wrt (i) = softmax
(kt ·Mt(i)
‖kt‖ · ‖Mt(i)‖
)I Least Recently Used
Access (LRUA)
I Strategy: 1 timeoff-set
22 / 53
Dynamic Few-Shot Visual Learning without Forgetting(Gidaris et al. ’18)
Use cosine similarity in the classification layer instead ofdot-product
Generate prototypical vector of each class using:I Average of features in support set (like prototypical network)I Attention-based inference from memory of base classes
featuresI Base class features are like a dictionaryI A memory matrix MI A key matrix KI w′att =
1N ′
∑N ′
i=1
∑Kbaseb=1 Att (φqz
′i, kb) · wb
23 / 53
Target: p(φi|Dsupport
i , θ)
What formations can φi be?
I Contain task-specific information
I Contain only useful information
Solution 2: network weights
24 / 53
Meta Networks (Munkhdalai et al. ’17)I Slow weight + Fast weightI meta feature extractor fm (key embedding) with slow weightM , base feature extractor fb with slow weight B, meta weightgenerator gm, base weight generator gb
I Meta-information: gradients
25 / 53
Meta Networks (Munkhdalai et al. ’17)I Generate fast weight for meta feature extractor:
I Sample T examples (x′i, y′i) in support set
I For each i: ∇i ← ∇MLembed(fm(M,x′i), y′i)
I M∗ ← gm({∇i}) (then meta feature extractor fm can useslow weight M + fast weight M∗)
I Store support set base fast weights in NTM:I For all samples (x′i, y
′i) in support set:
I ∇i ← ∇BLtask(fb(B, x′i), y′i)
I B∗i ← gb(∇i) (slow weight for base extractor)I Store B∗i in memory U(i)I r′i ← fm(M,M∗, x′i) and store in key matrix K(i)
I Use fast weight from memory to extract features for querysamples:I For all samples (xi, yi) in query set:
I ri ← fm(M,M∗, xi) (key)I Access memory using ri and get W ∗i (fast weight for fb)I Extract feature fb(W,W ∗i , xi) and compute total task loss
I Use total loss to update slow weights and weights for weightgenerators
26 / 53
Meta Networks (Munkhdalai et al. ’17)
I Remaining question: can we find a better meta-information?(in this work: gradients)
27 / 53
Model-Based Methods: Summary
I Meta Knowledge: Slow weight
I Task Knowledge: Generated weights & support set features
I Advantage:I Applicable to many learning problems
I Disadvantage:I Complicated model (model & architecture intertwined)I Difficult to optimize
28 / 53
Target: p(φi|Dsupport
i , θ)
Since generating weights (φi) is very computational cost, why notupdating existing weights θ?
fine-tune!
29 / 53
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
30 / 53
Optimization-Based Methods
I Idea:I Fine-tune to a specific taskI At test time, given a task t, θt ← g(θ,Dsupport). Then we
have y = f (θt, x)
I What can we do in fine-tune?
31 / 53
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
32 / 53
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
I An observation:I Gradient descent:
θt = θt−1 − αt∇θt−1Lt
I Cell update in LSTM:
ct = ft � ct−1 + it � ct
I If ft = 1, it = at and ct = ∇θt−1Lt:
ct can be treated as θt
I Dynamic ft and ct:
it = σ(WI ·
[∇θt−1Lt,Lt, θt−1, it−1
]+ bI
)ft = σ
(WF ·
[∇θt−1Lt,Lt, θt−1, ft−1
]+ bF
)I Weight Sharing
33 / 53
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
34 / 53
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
Utilize Loss (Lt), Gradient (∇θt−1Lt), network weight (θi) andprevious state (it−1&ft−1) as meta-information.
35 / 53
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
36 / 53
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
I Learn initial parameters that is easy to adapt to all tasks inthe task distributionI Initial parameters: prior knowledge (meta knowledge)I Easy: one step or few steps
37 / 53
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
Task-specific Adaption
Meta-knowledge Update
38 / 53
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
Does optimal point in parameter space exsits?
39 / 53
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
40 / 53
Fast Context Adaptation via Meta-Learning (Zintgraf et al.’19)
I Meta parameters φ and task-specific parameters θ
I Divide parameters to meta and task-specific, similar to LSTMmeta-learner (Ravi et al. ’17) and Meta Networks(Munkhdalai et al. ’17)
41 / 53
Performance
42 / 53
Optimization-Based Methods: Summary
I Meta Knowledge: Outer loop optimization
I Task Knowledge: Inner loop optimization
I Advantages:I Applicable to many kinds of tasksI Model-agnostic
I Disadvantages:I Second-order optimization (first-order MAML & Reptile)
43 / 53
Comparison
Metric-Based Model-Based Optimization-Based
Applicable Tasks Classification orverficiation
All All
Applicable Models All feature ex-tractors
DesignedModel
All BP-basedmodels
Computational Cost Low High High
Optimization Easy Hard Hard
Meta Information Feature space& distancemetric
Slow weight Outer loop op-timized weights
Task Information None Generated fea-tures & weights
Inner loop opti-mized weights
44 / 53
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
45 / 53
MetaGAN: An Adversarial Approach to Few-Shot Learning(Zhang et al. ’18)
LD = Ex∼QuTlog pD(y ≤ N |x) + Ex∼pTG
log pD(N + 1|x)
Explanation:
I Enrich task samples withoutliers
I cat & car > cat & dog →better decision boundary
46 / 53
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)Temporal Generation:
p(x) =
T∏t=1
p (xt|x1, . . . , xt−1)
Causal Temporal Convolutional Layers:
Dilated Causal Convolutional Layers (van den Oord et al. ’16):
47 / 53
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)
p(ytest|xtest, Xsupport, Y supprot
)
Only Coarse access of previousinputs (support set)
48 / 53
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)
Attention is all you need!
Dilated Layer
Attention Layer
49 / 53
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
50 / 53
Summary: Common Ideas
I Knowledge DesignI Meta-knowledgeI Task-specific knowledge (support set)
I Network DesignI Discriminate meta-knowledge and task-specific knowledgeI Combine meta-knowledge and task-specific knowledge
I Traning StrategyI Joint training or mimicing the test case?I Outloop and innerloop for updating different knowledge
51 / 53
Summary: Challenges
I OverfitI Sample-level overfitI Meta-overfit
I Optimization
I Ambiguity
52 / 53
Thanks for your listening!
53 / 53
top related