state of the art - israel machine vision conference
TRANSCRIPT
Few-shot learningState of the Art
Joseph Shtok
IBM Research AI
The presentation is available at http://www.research.ibm.com/haifa/dept/imt/ist_dm.shtml
Problem statementUsing a large annotated offline dataset,
dog
elephant
monkey
Offline training
perform for novel categories,represented by just a few samples each.
knowledge transfer
lemur
rabbit
mongoose
model for novel categories
given task
Online training
β¦
Problem statementUsing a large annotated offline dataset,
knowledge transfer
classifierfor novel categories
classification
Online training
Offline training
perform for novel categories,represented by just a few samples each.
dog
elephant
monkey
β¦
lemur
rabbit
mongoose
Problem statementUsing a large annotated offline dataset,
knowledge transfer
detectorfor novel categories
detection
Online training
Offline training
perform for novel categories,represented by just a few samples each.
dog
elephant
monkey
β¦
lemur
rabbit
mongoose
Problem statementUsing a large annotated offline dataset,
knowledge transfer
regressorfor novel categories
regression
Online training
Offline training
perform for novel categories,represented by just a few samples each.
dog
elephant
monkey
β¦
lemur
rabbit
mongoose
Why work on few-shot learning?
1. It brings the DL closer to real-world business usecases.β’ Companies hesitate to spend much time and money on
annotated data for a solution that they may profit.
β’ Relevant objects are continuously replaced with new ones. DL has to be agile.
2. It involves a bunch of exciting cutting-edge technologies.
Meta-learning methods
Networks generating networks
Data synthesizers
Semantic metric spaces
Graph neural networks
Neural Turing Machines
GANs
Meta-learningLearn a learning strategy
to adjust well to a new few-shot learning task
Data augmentationSynthesize more data from the novel
classes to facilitate the regular learning
Metric learningLearn a `semantic` embedding space using a distance loss function
Few-shot learning
Each category is represented by just a few examples
Learn to perform classification, detection, regression
Training a meta learner to learn
on each task
Meta-Learning
Standard learning: datadatadata instances
training a learner on the data
model
Meta learning: datadatatasks
model
model
learning strategy
data knowledge
task-specific
learner
task-agnostic
specific classestraining datatarget data
task-specific
meta-learner
datadatatask data
meta-learner
New task
Recurrent meta-learnersMatching Networks in Vinyals et.al., NIPS 2016
Distance-based classification: based on similarity between the query and support samples in the embedding space (adaptive metric):
ΰ·π¦ =
π
π ΰ·π₯, π₯π π¦π , π ΰ·π₯, π₯π = π ππππππππ‘π¦ π ΰ·π₯, π , π π₯π , π
π, π - LSTM embeddings of π₯ dependent on the support set S
β’ Embedding space is class-agnosticβ’ LSTM attention mechanism adjusts the embedding to the task
reprinted from Vinyals et.al., 2016
to be elaborated later
Concept of episodes: test conditions in the training.β’ N new categoriesβ’ M training examples per categoryβ’ one query example in {1..N} categories. β’ Typically, N=5, M=1, 5.
Memory-augmented neural networks (MANN) in Santoro et. al., ICML 2016
β’ Neural Turing Machine = differentiable MANNβ’ Learn to predict the distribution π(π¦π‘|π₯π‘, π1:π‘β1; π)β’ Explicitly store the support samples in the external memory
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
OptimizersOptimize the learner to perform well after fine-tuning on the task data done by a single (or few) step(s) of Gradient Descent.
MAML (Model-Agnostic Meta-Learning) Finn et.al., ICML 2017
Standard objective (task-specific, for task T):
minΞΈ
βT ΞΈ , learned via update ΞΈβ² = ΞΈ β Ξ± β π»ΞΈβT(ΞΈ)
Meta-objective (across tasks):
minΞΈ
ΟT~p(β)βT ΞΈβ² , learned via an update ΞΈ β ΞΈ β Ξ²π»ΞΈ ΟT~p(β)βT ΞΈβ²
reprinted from Li et.al., 2017
Meta-SGD Li et.al., 2017
βInterestingly, the learning process can continue forever, thus enabling life-long learning, and at any moment, the meta-learner can be applied to learn a learner for any new task.β
Meta-SGD Li et.al., 2017
Render Ξ± as a vector of size ΞΈ. Training objective: minimize the loss βT ΞΈβ² with respect to the ΞΈ =weights initialization, Ξ± = update direction and scale, across the tasks.
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Meta-SGD 54.24 / 70.86
Optimizers
LEO Rusu et.al., NeurIPS 2018
Latent Embedding Optimization: take the optimization problem from high-dim. space of weights π½ to a low-dim. space, for robust Meta-Learning.
Learn a generative distribution of model parameters π½, by learning a stochastic latent space with an information bottleneck.
Relation network encodes distances between elements of support set for a new task
πβ = πππminπ
π~π β
βπ πβ² = ππππ§β² , π§β² = π§ β πΌ β π»π§βπ ππ§ ,
π§ = ππππ , ππ§= πππ
π§
datadatadata instances
stochastic encoder
latent representation π
parameter generator
model parameters
π½
ππ ππ
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Meta-SGD 54.24 / 70.86
LEO 61.76 / 77.59
Metric Learning
Offline training
datadatadata instances
deep embedding
model
Training: achieve good distributions for offline categoriesInference: Nearest Neightbour in the embedding space
query
semantic embedding space
class 3
class 1
class 2
class C
class A
class B
New task data d(q,A)d(q,C)
d(q,B)
classification
Metric LearningRelation networks, Sung et.al., CVPR 2018
Use the Siamese Networks principle :
β’ Concatenate embeddings of query and support samples
β’ Relation module is trained to produces score 1 for correct class and 0 for others
β’ Extends to zero-shot learning by replacing support embeddings with semantic features.
replicated from Sung et.al., Learning to Compare - Relation network for few-shot learning, CVPR 2018
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Relation networks 50.44 / 65.32
Meta-SGD 54.24 / 70.86
LEO 61.76 / 77.59
Metric Learning
Matching Networks, Vinyals et.al., NIPS 2016
Objective: maximize the cross-entropy for the non-parametric softmax
classifier Ο(π₯,π¦) πππππ π¦ π₯, π , with
ππ π¦ π₯, π = π πππ‘πππ₯ πππ π ΰ·π₯, π , π π₯π , π
Each category is represented by a single prototype ππ .
Prototypical Networks, Snell et al, 2016:
Each category is represented by it mean sample (prototype).
Objective: maximize the cross-entropy with the prototypes-based
probability expression:
ππ π¦ π₯, πΆ =π πππ‘πππ₯ βπΏ2 π ΰ·π₯ , π ππ
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Relation networks 50.44 / 65.32
Prototypical Networks
49.42 / 68.20
Meta-SGD 54.24 / 70.86
LEO 61.76 / 77.59
Metric Learning
Large Margin Meta-Learning Wang el.al, 2018
Regularize the cross-entropy (CE) based learning with large-margin objective:
β = βπΆπΈ +
π‘ππππππ‘π
|ππ β ππ + πΌ|+
ππ
ππ
samplepositive
(hard)-negative
Distances for triplet loss
RepMet: Few-shot detection Karlinsky et.al., CVPR 2019
β’ Equip a standard detector with a distance-based classifier.
β’ Introduce class representatives π ππ as model weights.
β’ Learn an embedding space using the objective
β = βπΆπΈ + minπ π πΈ, π ππ βminπ,πβ π π πΈ, π ππ + πΌ+
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Relation networks 50.44 / 65.32
Prototypical Networks
49.42 / 68.20
Large-margin 51.08 / 67.57
Meta-SGD 54.24 / 70.86
LEO 61.76 / 77.59
Sample synthesisOffline stage
datadatadata instances
train a synthesizer sampling from
class distribution
synthesizer model
data knowledge
On new task data
datafew data instances
synthesizer model
novel classes
datadatamany data instances
train a model
taskmodel
datadataoffline data
Synthesizer optimized for classification
Low-Shot Learning from Imaginary DataWang et.al., 2018
β’ The classifier β is differentiable w.r.t. the training data
β’ In each episode, the backpropagated gradient updates the synthesizer
Method ImageNet top5 classification accuracy 1/5 shot
Novel Base+Novel
Wang et.al., 2018 43.3 / 68.4 55.8 / 71.1
β’ The synthesizer is a part of classifier pipeline, trained end-to end
reprinted from Wang et.al., 2018
More augmentation approachesΞ-encoder Schwartz et.al., NeurIPS 2018
β’ Use a variant of autoencoder to capture the intra-class difference between two class samples in the latent space.
β’ Transfer class distributions from training to novel classes.
ππ ππ·ππππΈππ
Semantic Feature Augmentation in Few-shot Learning, Chen et.al., 2018
β’ Synthesize samples by adding noise in the autoencoderβs bottleneck space.
β’ Make it into a semantic space, using word embeddings or visual attributes for the objectiveβs fidelity term.
Encoder
Decoder
π
Sampledtarget
Sampled reference
Sampled delta
New classreference
Synthesizednew classexample
Synthesis
MethodminiImageNet classification
accuracy 1/5 shot
Matching networks 43.56 / 55.31
MAML 48.70 / 63.11
Relation networks 50.44 / 65.32
Prototypical Networks
49.42 / 68.20
Large-margin 51.08 / 67.57
Meta-SGD 54.24 / 70.86
Semantic Feat. Aug. 58.12 / 76.92
Ξ-encoder 59.9 / 66.7
LEO 61.76 / 77.59
Augmentation with GANsCovariance-Preserving Adversarial Augmentation Networks Gao et.al., NeurIPS 2018
β’ Act in in feature space. Generate samples for novel categories from offline categories, selected by proximity of samples.
β’ Discriminate by classication: real image β correct classsynthetic image β βfakeβ
novel
Dbase
πΊπ
βcatβ
catpumaβ¦fake
β’ Cycle-consistency constraint:
β’ Preserve the category distribution during transfer of baseβ novel category. Objective: penalize the difference between the two covariance matrices viaKy Fan m-norm, i.e., the sum of singular values of m-truncated SVD:
πΊπ πΊπ
Method ImageNet top5 classification accuracy 1/5 shot
Novel Base+Novel
Wang et.al., 2018 43.3 / 68.4 55.8 / 71.1
Gao et.al., 2018 48.4 / 70.2 58.5 / 73.5
My personal view on the evolution of Machine Learning
Developing ML: continuous life-long learning on various tasks
Classic ML: One dataset, one task, one heavy training
Few-shot ML: Heavy offline training, then easy learning on similar tasks
Thank you
The presentation is available at http://www.research.ibm.com/haifa/dept/imt/ist_dm.shtml