one model to learn them all - university of illinois · multimodal learning: single task, different...

35
One Model To Learn Them All Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit CS546 Course Presentation Shruti Bhargava (shrutib2) Advised by : Prof. Julia Hockenmaier

Upload: others

Post on 23-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

One Model To Learn Them All

Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

CS546 Course PresentationShruti Bhargava (shrutib2)

Advised by : Prof. Julia Hockenmaier

Page 2: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets

➢ Training details

➢ Performance Evaluation

➢ Key contributions/ Limitations

Page 3: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MotivationWhat is your favourite fruit?

Apple /ˈapəl/

ImageModality

Text Modality

AudioModality

Write? Draw? Speak?

1. Process the question and think of an answer2. Convey the answer to me

Page 4: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Motivation➢ Humans reason about concepts independent of input/output

modality

➢ Humans are able to reuse conceptual knowledge in different

tasks

Page 5: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Understanding the task➢ Multimodal Learning: single task, different domains

Eg. Visual Question Answering

Input: Images + Text, Output: Text

➢ Multitask Learning: multiple tasks, mostly same domain

Eg. Translation + Parsing

➢ This work = Multimodal + Multitask

Page 6: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Question addressed :Can one unified model solve tasks across

multiple domains?

Page 7: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Multiple Tasks/Domains, One Model Multiple Tasks/Domains, One Model -

MultiModel

Page 8: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets

➢ Training details

➢ Performance Evaluation

➢ Key contributions / Limitations

Page 9: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel Architecture➢ Modality Nets

➢ Encoder-Decoder

➢ I/O Mixer

Page 10: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Input → Output➢ Modality Net: domain-specific input → unified representation

➢ Encoder: unified input representations → encoded input

➢ I/O Mixer: encoded input ⇌ previous outputs

➢ Decoder: decodes (input + mixture) → output representation

➢ Modality Net: unified representation → domain-specific output

Page 11: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Input → Output

Input

Modality Nets

Output

Page 12: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Modality NetsDomain-specific Representation ↔ Unified Representation

4 modality nets - One net per domain

➢ Language

➢ Image

➢ Audio

➢ Categorical - only output

Page 13: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Modality Nets: Language Modality

Input Net -

Output Net -

See Details for Vocabulary construction here.

Input tokenized using 8k subword units

➢ Acts as an open vocabulary example - [ad|mi|ral]

➢ Accounts for rare words

Page 14: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Domain Agnostic Body

Page 15: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Domain Agnostic Body

Input Encoder I/O Mixer Decoder

Page 16: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

MultiModel: Building BlocksCombines 3 state-of-the-art blocks:

➢ Convolutional: SOTA for images

➢ Attention: SOTA in language understanding

➢ Mixture-of-Experts (MoE): studied only for language

Page 17: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Building Block: ConvBlock

Depthwise Separable Convolutions➢ convolution on each feature channel➢ pointwise convolution for desired depth.

Layer Normalisation➢ Statistics computed for a layer (per sample)

See Details on Layer normalisation and Separable Convolutions.

Page 18: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Building Block: Attention

See Details on the attention block here.

Page 19: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Building Block: Mixture of Experts Sparsely-gated mixture-of-experts layer

➢ Experts: feed-forward neural networks

➢ Selection: trainable gating network

➢ Known booster for language tasks

See Details on the MoE block here.

Page 20: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Structurally similar to Bytenet, read here

Page 21: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets

➢ Training details

➢ Performance Evaluation

➢ Key contributions / Limitations

Page 22: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Datasets/Tasks➢ WSJ speech

➢ WSJ parsing

➢ ImageNet

➢ COCO image-captioning

➢ WMT English-German

➢ WMT German-English

➢ WMT English-French

➢ WMT German-French

Page 23: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets

➢ Training details

➢ Performance Evaluation

➢ Key contributions / Limitations

Page 24: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Training Details➢ Token for task eg. To-English or To-Parse-Tree, to decoder.

Embedding vector for each token learned.

➢ Mixture of experts block :

● 240 experts for joint training, 60 for single training

● Gating selects 4

➢ Adam optimizer with gradient clipping

➢ Experiments on all tasks use same hyperparameter values

Page 25: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets Used

➢ Training details

➢ Experiments/ Results

➢ Key contributions / Limitations

Page 26: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Experiments➢ MultiModel vs state-of-the-art?

➢ Does simultaneous training on 8 problems help?

➢ Blocks specialising in one domain help/harm other?

Page 27: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Results1. MultiModel vs state-of-the-art ?

Page 28: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Results2. Does simultaneous training help?

Page 29: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Results3. Blocks specialising in one domain help/harm other?

MoE, Attention - language experts

Page 30: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Outline➢ Motivation

➢ Understanding the task

➢ Model Architecture

➢ Datasets Used

➢ Training details

➢ Performance Evaluation

➢ Key contributions / Limitations

Page 31: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Key Contributions➢ First model performing large-scale tasks on multiple domains.

➢ Sets blueprint for potential future AI (broadly applicable)

➢ Designs multi-modal architecture with blocks from diverse

modalities

➢ Demonstrates transfer learning across domains

Page 32: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Limitations➢ Comparison with SOTA - last few percentages, when models

approach 100% is the most crucial part

➢ Incomplete Experimentation - Hyperparameters not tuned

➢ Incomplete Results Reported - Only for some tasks

➢ Could be less robust to adversarial samples attack

Page 33: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

References➢ https://venturebeat.com/2017/06/19/google-advances-ai-with-one-model-to-lea

rn-them-all/➢ https://aidangomez.ca/multitask.pdf➢ https://blog.acolyer.org/2018/01/12/one-model-to-learn-them-all/➢ Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural

Information Processing Systems. 2017.➢ Chollet, François. "Xception: Deep learning with depthwise separable

convolutions." arXiv preprint (2016).➢ Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated

mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

Page 34: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Thank You!

Page 35: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask

Modality NetsImage Modality Net - analogous to Xception entry flow, uses residual convolution blocks

Categorical Modality Net - analogous to Xception exit flow, Global average pooling after conv layers

Audio Modality Net - similar to Image Modality Net