multi-task & meta-learning basicscs330.stanford.edu/slides/cs330_lec2.pdfplan for today...
TRANSCRIPT
![Page 1: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/1.jpg)
CS 330
Multi-Task & Meta-Learning Basics
![Page 2: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/2.jpg)
Logistics
Homework 1 posted today, due Wednesday, October 9
Fill out paper preferences by tomorrow.
TensorFlow review session tomorrow, 4:30 pm in Gates B03
![Page 3: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/3.jpg)
Plan for Today
Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning
— short break —
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches Topic of Homework 1!}
![Page 4: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/4.jpg)
Multi-Task Learning Basics
![Page 5: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/5.jpg)
Some notation
Typical loss: negative log likelihood
ℒ(θ, 𝒟) = − 𝔼(x,y)∼𝒟[log fθ(y |x)]
x y
fθ(y |x)
θ
A task: 𝒯i ≜ {pi(x), pi(y |x), ℒi}
data generating distributions
Corresponding datasets: 𝒟tri 𝒟test
i
Single-task learning:[supervised]
What is a task?𝒟 = {(x, y)k}min
θℒ(θ, 𝒟)
(more formally this time)
will use as shorthand for :𝒟i 𝒟tri
tiger
tiger
cat
lynx
cat
length of paper
![Page 6: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/6.jpg)
Examples of Tasks
data generating distributions
Corresponding datasets:
Multi-task classification:
, same across all tasksℒi pi(x)e.g. CelebA attribute recognition
e.g. per-language handwriting recognition
e.g. personalized spam filter
e.g. scene understanding
will use as shorthand for :𝒟i 𝒟tri
- mixed discrete, continuous labels across tasks - if you care more about one task than another
When might vary across tasks?ℒi
A task: 𝒯i ≜ {pi(x), pi(y |x), ℒi}
𝒟tri 𝒟test
i
same across all tasksℒi
Multi-label learning:
![Page 7: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/7.jpg)
x y
fθ(y |x)
θ
fθ(y |x, zi)task descriptor
zi
e.g. one-hot encoding of the task index
Objective: minθ
T
∑i=1
ℒi(θ, 𝒟i)
A model decision and an algorithm decision:
How should we condition on ?zi
How to optimize our objective?
or, whatever meta-data you have
- personalization: user features/attributes- language description of the task - formal specifications of the task
length of paper
paper review
summary of paper
![Page 8: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/8.jpg)
Conditioning on the task
Question: How should you condition on the task in order to share as little as possible?
Let’s assume is the task index.zi
![Page 9: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/9.jpg)
Conditioning on the task
x y1
x y2
x yT…
zi
—> independent training within a single network!with no shared parameters
y = ∑j
1(zi = j)yj
multiplicative gating
![Page 10: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/10.jpg)
The other extreme
x y
all parameters are shared except
the parameters directly following zi
Concatenate with input and/or activationszi
zi
![Page 11: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/11.jpg)
An Alternative View on the Multi-Task Objective
Then, our objective is: minθsh,θ1,…,θT
T
∑i=1
ℒi({θsh, θi}, 𝒟i)
Split into shared parameters and task-specific parameters θ θsh θi
Choosing how to condition on zi
equivalent toChoosing how & where
to share parameters
![Page 12: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/12.jpg)
Conditioning: Some Common Choices
Diagram sources: distill.pub/2018/feature-wise-transformations/
1. Concatenation-based conditioning
zi
2. Additive conditioning
zi
These are actually the same!
![Page 13: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/13.jpg)
Conditioning: Some Common Choices4. Multiplicative conditioning
Why might multiplicative conditioning be a good idea?
- more expressive - recall: multiplicative gating
Diagram sources: distill.pub/2018/feature-wise-transformations/
3. Multi-head architecture
Multiplicative conditioning generalizes independent networks and independent heads.
Ruder ‘17
![Page 14: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/14.jpg)
Conditioning: More Complex Choices
Cross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network. Liu, Johns, Davison ‘18
Deep Relation Networks. Long, Wang ‘15
Sluice Networks. Ruder, Bingel, Augenstein, Sogaard ‘17
![Page 15: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/15.jpg)
Unfortunately, these design decisions are like neural network architecture tuning:
- problem dependent - largely guided by intuition or
knowledge of the problem - currently more of an art than a
science
Conditioning Choices
![Page 16: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/16.jpg)
Optimizing the objective
Objective: minθ
T
∑i=1
ℒi(θ, 𝒟i)
Basic Version:
1. Sample mini-batch of tasks
2. Sample mini-batch datapoints for each task
3. Compute loss on the mini-batch:
4. Backpropagate loss to compute gradient 5. Apply gradient with your favorite neural net optimizer (e.g. Adam)
ℬ ∼ {𝒯i}
𝒟bi ∼ 𝒟i
ℒ̂(θ, ℬ) = ∑𝒯k∈ℬ
ℒk(θ, 𝒟bk)
∇θℒ̂
Note: This ensures that tasks are sampled uniformly, regardless of data quantities.
Tip: For regression problems, make sure your task labels are on the same scale!
![Page 17: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/17.jpg)
Challenges
![Page 18: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/18.jpg)
Challenge #1: Negative transfer
Sometimes independent networks work the best.Negative transfer:
Multi-Task CIFAR-100
state-of-the-art approaches
Why? - optimization challenges - caused by cross-task interference - tasks may learn at different rates
- limited representational capacity - multi-task networks often need to be much larger
than their single-task counterparts
![Page 19: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/19.jpg)
It’s not just a binary decision!
minθsh,θ1,…,θT
T
∑i=1
ℒi({θsh, θi}, 𝒟i) +T
∑t′�=1
∥θt − θt′�∥
“soft parameter sharing”
If you have negative transfer, share less across tasks.
+ allows for more fluid degrees of parameter sharing
- yet another set of design decisions / hyperparameters
y1x
x yT
…<->
<->
<->
<-> constrained weights
![Page 20: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/20.jpg)
Challenge #2: Overfitting
You may not be sharing enough!
Multi-task learning <-> a form of regularization
Solution: Share more.
![Page 21: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/21.jpg)
Case study
Goal: Make recommendations for YouTube
![Page 22: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/22.jpg)
Case study
Conflicting objectives:
- videos that users will rate highly - videos that users they will share - videos that user will watch
implicit bias caused by feedback: user may have watched it because it was recommended!
Goal: Make recommendations for YouTube
![Page 23: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/23.jpg)
Framework Set-Up
1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user
Candidate videos: pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others
Input: what the user is currently watching (query video) + user features
Ranking: central topic of this paper
![Page 24: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/24.jpg)
The Ranking Problem
Input: query video, candidate video, user & context features
Model output: engagement and satisfaction with candidate video
Engagement: - binary classification tasks like clicks - regression tasks for tasks related to time spent
Satisfaction: - binary classification tasks like clicking “like” - regression tasks for tasks such as rating
Weighted combination of engagement & satisfaction predictions -> ranking score
score weights manually tuned
![Page 25: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/25.jpg)
The Architecture
Basic option: “Shared-Bottom Model"(i.e. multi-head architecture)
-> harm learning when correlation between tasks is low
![Page 26: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/26.jpg)
Instead: use a form of soft-parameter sharing“Multi-gate Mixture-of-Experts (MMoE)" expert neural networks
Allow different parts of the network to “specialize"
Decide which expert to use for input x, task k:
Compute features from selected expert:
Compute output:
The Architecture
![Page 27: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/27.jpg)
- Implementation in TensorFlow, TPUs - Train in temporal order, running training
continuously to consume newly arriving data - Offline AUC & squared error metrics - Online A/B testing in comparison to
production system - live metrics based on time spent, survey
responses, rate of dismissals - Model computational efficiency matters
ExperimentsResultsSet-Up
Found 20% chance of gating polarization during distributed training -> use drop-out on experts
![Page 28: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/28.jpg)
Plan for Today
Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning
— short break —
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches Topic of Homework 1!}
![Page 29: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/29.jpg)
Meta-Learning Basics
![Page 30: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/30.jpg)
Two ways to view meta-learning algorithms
Mechanistic view
➢ Deep neural network model that can read in an entire dataset and make predictions for new datapoints
➢ Training this network uses a meta-dataset, which itself consists of many datasets, each for a different task
➢ This view makes it easier to implement meta-learning algorithms
Probabilistic view
➢ Extract prior information from a set of (meta-training) tasks that allows efficient learning of new tasks
➢ Learning a new task uses this prior and (small) training set to infer most likely posterior parameters
➢ This view makes it easier to understand meta-learning algorithms
![Page 31: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/31.jpg)
Problem definitions
What is wrong with this?
➢ The most powerful models typically require large amounts of labeled data ➢ Labeled data for some tasks may be very limited
model parameters training data labelinput (e.g., image)
data likelihood regularizer (e.g., weight decay)
![Page 32: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/32.jpg)
Problem definitions
Image adapted from Ravi & Larochelle
![Page 33: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/33.jpg)
The meta-learning problem
this is the meta-learning problem
![Page 34: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/34.jpg)
A Quick Example
test input
test label
![Page 35: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/35.jpg)
How do we train this thing?
Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match”
Vinyals et al., Matching Networks for One-Shot Learning
test input
test label
![Page 36: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/36.jpg)
How do we train this thing?
Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match”
Vinyals et al., Matching Networks for One-Shot Learning
???
???
(meta) test-time (meta) training-time
test input
test label
![Page 37: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/37.jpg)
Reserve a test set for each task!
Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match”
Vinyals et al., Matching Networks for One-Shot Learning
(meta) training-time
![Page 38: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/38.jpg)
The complete meta-learning optimization
unobserved at meta-test time
![Page 39: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/39.jpg)
Some meta-learning terminology
image credit: Ravi & Larochelle ‘17
(meta-training) task
(meta-test) task
support (set)query
shot (i.e., k-shot, 5-shot)
![Page 40: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/40.jpg)
Closely related problem settings
course
![Page 41: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/41.jpg)
Plan for Today
Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning
— short break —
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches Topic of Homework 1!}
Rest will be covered next time.
![Page 42: Multi-Task & Meta-Learning Basicscs330.stanford.edu/slides/cs330_lec2.pdfPlan for Today Multi-Task Learning -Models & training -Challenges -Case study of real-world multi-task learning](https://reader033.vdocuments.mx/reader033/viewer/2022052423/5f0b1b567e708231d42ee19c/html5/thumbnails/42.jpg)
Reminders
Homework 1 posted today, due Wednesday, October 9
Fill out paper preferences by tomorrow.
TensorFlow review session tomorrow, 4:30 pm in Gates B03