large pretrained models - nlp.cs.hku.hk

39
Lingpeng Kong Department of Computer Science, The University of Hong Kong Many materials from Stanford CS224n with special thanks! Large Pretrained Models COMP3361 — Week 9

Upload: others

Post on 29-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large Pretrained Models - nlp.cs.hku.hk

Lingpeng Kong

Department of Computer Science, The University of Hong Kong Many materials from Stanford CS224n with special thanks!

Large Pretrained ModelsCOMP3361 — Week 9

Page 2: Large Pretrained Models - nlp.cs.hku.hk

Pretrained Models in the Past Four Years

Microsoft Research Blog. Oct 6, 2021.

Page 3: Large Pretrained Models - nlp.cs.hku.hk

Pretrained Models in the Past Four Years

Microsoft Research Blog. Oct 6, 2021.

Page 4: Large Pretrained Models - nlp.cs.hku.hk

Pretrained Models are Expensive

One single training run

552 metric tons of carbon dioxide (120 cars per year)

$12 million

Page 5: Large Pretrained Models - nlp.cs.hku.hk

Pretraining and Contextualized Word Representations

Transformer

I feel like eating [SEP][CLS] What you want ? [SEP]like [MASK] [MASK]

NSP MLM

Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>

Page 6: Large Pretrained Models - nlp.cs.hku.hk

Pretraining and Contextualized Word Representations

Jurassic Park lacks the emotional unity of Spielberg’s classics .

Neural Network Encoder (LSTMs, Transformers, etc.)

contextualized word representation

Implicit linguistic knowledge

Page 7: Large Pretrained Models - nlp.cs.hku.hk

Pretraining and Fine-tuning

Jurassic Park lacks the emotional unity of Spielberg’s classics .

Neural Network Encoder (LSTMs, Transformers, etc.)

hundreds of millions of parameters

MLP Layer

Hundreds Parameters

$7,079

Page 8: Large Pretrained Models - nlp.cs.hku.hk

Key Elements in BERT

Transformer

Masked Language Modeling (MLM), Next Sentence Prediction (NSP)

— pretraining objective

— neural representation learner

Bidirectional Encoder — type of architecture

Page 9: Large Pretrained Models - nlp.cs.hku.hk

Neural Representation Learners

Transformer LSTM

ELMo

BERT GPT-2

GPT-3 BART

T5

XLNet

Page 10: Large Pretrained Models - nlp.cs.hku.hk

Why Transformers?

computing block 1

<latexit sha1_base64="cYdXFjFi2uU4mX1aMyu2o5al0oU=">AAACRXicbVDLSgMxFM3UVx3funQzWAQXpcyIqBux6MZlBVuFtkgmk2lDM0lI7gh16G+4VfBj+gniR7gTt5rpdOHrQsLhnHu5555QcWbA91+d0szs3PxCedFdWl5ZXVvf2GwZmWpCm0RyqW9CbChngjaBAac3SlOchJxeh4PzXL++o9owKa5gqGg3wT3BYkYwWKrT4ZEEU/y36xW/5k/K+wuCKaicjt0T9fziNm43nEonkiRNqADCsTHtwFfQzbAGRjgduZ3UUIXJAPdo20KBE2q62cT0yNu1TOTFUtsnwJuw3ycynBgzTELbmWDom99aTv6ntVOIj7sZEyoFKkixKE65B9LLE/AipikBPrQAE82sV4/0scYEbE4/tgAb3BdX5IizUGM9zGIGVSUNywNkoleNKJF6EqepKesmkVr1c4FgTkaujTX4HeJf0NqvBYe1g0u/Uj9DRZXRNtpBeyhAR6iOLlADNRFBCj2gR/TkjJ035935KFpLznRmC/0o5/MLLli2qw==</latexit>. . . . . .

FFN

<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt

computing block 2

FFN

<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1

computing block i

Page 11: Large Pretrained Models - nlp.cs.hku.hk

Why Transformers?

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

<latexit sha1_base64="Dpg4A5h/OyqHOrXGvQTH1K0B4Xg=">AAACVXicbZDfShwxFMYz4//Vumt7pzehiyBFlxmR1kuhF/ZSwVXp7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfNputrY+XTueWQ5drqe11whxIoaCLAiVcGwssSyRcJdPvVX71C6wTWl3gzMAgY2MlUsEZemvY2u4j/Mb6nSKROZTFZFjgQVyWdNhqR52oLvpWxE+ifbJ7f3yqv/w8G24F7f5I8zwDhVwy53pxZHBQMIuCSygb/dyBYXzKxtDzUrEM3KCoh5d01zsjmmrrj0Jau89vFCxzbpYlvjNjOHGvs8p8L+vlmB4PCqFMjqD4fFCaS4qaVkToSFjgKGdeMG6F35XyCbOMo+f2YgqK6Z/5LyolRWKZnRWpwH2jnaiACjXeHwHXtsbrOsZvk2lrJlXAmeRlw2ONX0N8Ky4PO/HXztG553tK5rVKdshnskdi8o2ckB/kjHQJJzfkL7kld8G/4CFcDJfnrWHwdOcTeVFh8xE6+bl4</latexit>

ht�1

<latexit sha1_base64="x6yT+powiPzANGgKm18RcFeQf+k=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkIVjWaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBfIC/VA==</latexit>xt�1

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

<latexit sha1_base64="nrkEd8t8hkoI/4c7l9PSpLwa7iI=">AAACVXicbZDfShwxFMYz4//Vdtf2Tm+CiyCtLDMirZdCL+ylgqvi7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfPjZbm58unM4thy7XUturhDmQQkEXBUq4MhZYlki4TKY/qvzyF1gntDrHmYFBxsZKpIIz9NawtdVH+I31O0UicyiLybDAr3FZ0mGrHXWiuuhbET+J9vHu/dGJ/nJ9OtwM2v2R5nkGCrlkzvXiyOCgYBYFl1A2+rkDw/iUjaHnpWIZuEFRDy/prndGNNXWH4W0dp/fKFjm3CxLfGfGcOJeZ5X5XtbLMT0aFEKZHEHx+aA0lxQ1rYjQkbDAUc68YNwKvyvlE2YZR8/txRQU0z/zX1RKisQyOytSgftGO1EBFWq8PwKubY3XdYzfJtPWTKqAM8nLhscav4b4VlwcdOJvncMzz/eEzGuVbJMdskdi8p0ck5/klHQJJzfkL7kld8G/4CFcDJfnrWHwdOczeVFh8xE3Rbl2</latexit>

ht+1

<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1

<latexit sha1_base64="iCteJoXG2XVarc3QTnSciZoEvs0=">AAACUXicbZDPbhMxEMZnt/wpaYGUSlzKwSJC4lBFuxWiHCtxaI9FatpKSYi8zmxixWtb9mxFWPbYF+kVbr3zEpx4FG54Nz3QlpEsffq+sWf8y6ySnpLkdxSvPXj46PH6k87G5tNnz7tbL069KZ3AgTDKuPOMe1RS44AkKTy3DnmRKTzLFh+b/OwCnZdGn9DS4rjgMy1zKTgFa9J9OSL8Qu07VaZKrKv5hGo26faSftIWuy/SG9E7eHW5+fl6++fxZCvqjaZGlAVqEop7P0wTS+OKO5JCYd0ZlR4tFws+w2GQmhfox1U7uGZvgjNluXHhaGKt+++NihfeL4ssdBac5v5u1pj/y4Yl5R/GldS2JNRiNSgvFSPDGhpsKh0KUssguHAy7MrEnDsuKDC7NYXk4uvqF41SMnPcLatc0q41XjYwpZ7tTlEY16L1fRu2KYyz8yYQXIm6E7CmdyHeF6d7/fR9/92nwPcQVrUOO/Aa3kIK+3AAR3AMAxDwDa7gO/yIfkV/YojjVWsc3dzZhlsVb/wFGc64BQ==</latexit>

ht

<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

Page 12: Large Pretrained Models - nlp.cs.hku.hk

Why Transformers?

self-attention

Direct pair-wise interaction between any tokens in the sequence

Page 13: Large Pretrained Models - nlp.cs.hku.hk

Pretraining Objective

I feel like eating <MASK> today. What <MASK> you want to eat?

training instance (MLM):

x:y: noodles, do

training instance (NSP):

I feel like eating <MASK> today. ||| What <MASK> you want to eat? x:y: True

Page 14: Large Pretrained Models - nlp.cs.hku.hk

Pretraining Objective

What makes a good pretraining objective?

1. No human labeling should be involved.

2. Leads to good representations. (How and why?)

Page 15: Large Pretrained Models - nlp.cs.hku.hk

Mutual InformationI(A,B) = H(A)�H(A | B)

= H(B)�H(B | A).<latexit sha1_base64="QXgRizGYm9up2pzRMvocLgRogPM=">AAACOXicbZDLSgMxFIYz3h1vVZdugkVpRctMXehGaOum7ipYFTqlZDKnbTCTGZKMUIa+lhvfwp3gxoUibn0B02kFbwdCfr5zSc7vx5wp7TiP1tT0zOzc/MKivbS8srqWW9+4VFEiKTRpxCN57RMFnAloaqY5XMcSSOhzuPJvTkf5q1uQikXiQg9iaIekJ1iXUaIN6uQang89JlLCWU/sDe2zQnW/VsS7J7heqBbxwejyQhZgAz3PznhtzGsZrxZLtgci+JrQyeWdkpMF/ivcicijSTQ6uQcviGgSgtCUE6VarhPrdkqkZpTD0PYSBTGhN6QHLSMFCUG102zzId4xJMDdSJojNM7o946UhEoNQt9UhkT31e/cCP6XayW6e9xOmYgTDYKOH+omHOsIj2zEAZNANR8YQahk5q+Y9okkVBuzbWOC+3vlv+KyXHIPS+Xzcr5Sm9ixgLbQNiogFx2hCqqjBmoiiu7QE3pBr9a99Wy9We/j0ilr0rOJfoT18QkL9aXh</latexit>

Goal of Training:

I(f(A,B)) � Ep(a,b)

2

4f✓(a, b)� Eq(B̃)

2

4logX

b̃2B̃

exp f✓(a, b̃)

3

5

3

5+ log | B̃ |,<latexit sha1_base64="r6JuHrhWnyrMaV3s4JC3i/dP8q8=">AAADAXicdVLLjtMwFHXCayivDixYsLGoQCmUKikLWA5FSLAbJDozUl1FjnOTWuM4mdhBVJbZ8CtsWIAQW/6CHX+D04YR8+BKlo/OOfde+9pJJbjSYfjb8y9cvHT5ytbV3rXrN27e6m/f3lNlUzOYsVKU9UFCFQguYaa5FnBQ1UCLRMB+cviy1fffQ614Kd/pVQWLguaSZ5xR7ah427tLEsi5NFTwXD6yvTdBFrwYTYdD/JDkcIRJQfUyScwrG5sqoKNkaImATM+z2JCkFKlaFW4zRC9BU2vXFvzkRN5RQDQXKZg1yagwU2uHFm8KEVHmmKimcAU3tsRJXOJjd0e7JCfAhwr/v/dxhSGpeb7Ui27Dj/GmT8HTc+q29KhHQKZ/BxH3B+E4XAc+C6IODFAXu3H/F0lL1hQgNRNUqXkUVnphaK05E2B7pFFQUXZIc5g7KGkBamHWL2jxA8ekOCtrt6TGa/bfDEML1V7WOdvDq9NaS56nzRudPV8YLqtGg2SbRlkjsC5x+x1wymtgWqwcoKzm7qyYLWlNmXafpueGEJ2+8lmwNxlHT8eTt5PBzrQbxxa6h+6jAEXoGdpBr9EumiHmffQ+e1+9b/4n/4v/3f+xsfpel3MHnQj/5x8flvZ8</latexit>

Ep(a,b)

2

4f✓(a, b)� logX

b̃2B

exp f✓(a, b̃)

3

5 .

<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit> Cross Entropy (Softmax)

f✓(a, b) = g (b)>g!(a)

<latexit sha1_base64="TwjtQpEsIzbklvts/4XlEZugdnk=">AAACXXicbVHBattAEF0pTZuoaeomhx5yWWoKcSlGcg/tpRCSS48JxEnAcs1oNZKXrHbF7qhghH6yt/bSX8nacaBxMrDs4715zOzbrFbSURz/CcKtF9svX+3sRq/33uy/7b07uHKmsQLHwihjbzJwqKTGMUlSeFNbhCpTeJ3dni31619onTT6khY1TisotSykAPLUrEdphqXULShZ6k9dVMzaNDMqd4vKX21KcyToumP4nA34d15uyLWTXswGP32nqbtN2VRYrtyDKEWdP0yZ9frxMF4VfwqSNeizdZ3Per/T3IimQk1CgXOTJK5p2oIlKRR2Udo4rEHcQokTDzVU6KbtKp2Of/RMzgtj/dHEV+z/jhYqt1zYd1ZAc7epLcnntElDxbdpK3XdEGpxP6hoFCfDl1HzXFoUpBYegLDS78rFHCwI8h8S+RCSzSc/BVejYfJlOLoY9U9O13HssCP2gR2zhH1lJ+wHO2djJtjfgAW7QRT8C7fDvXD/vjUM1p5D9qjC93ew4Lce</latexit>

✓ = {!, }<latexit sha1_base64="jkshjWuQ3xBEJGiq6Fxr9wvRr2g=">AAACRHicbVBNSyNBEO1Rd9fNfph1j14aw4LIEmbiQS8LohePCkaFTAg1PZVJY0/30F0jhGF+3F78Ad78BV48KItXsROzYNSCph/v1auufkmhpKMwvA4WFpc+fPy0/Lnx5eu37yvNH6snzpRWYFcYZexZAg6V1NglSQrPCouQJwpPk/P9iX56gdZJo49pXGA/h0zLoRRAnho0e3GCmdQVKJnpzboRJ0albpz7q4pphAQ1/8Pjao43OWZQ/57jCifr2PtRp/+HDZqtsB1Oi78F0Qy02KwOB82rODWizFGTUOBcLwoL6ldgSQqFfnjpsABxDhn2PNSQo+tX0xBq/sszKR8a648mPmVfOirI3WRZ35kDjdxrbUK+p/VKGu70K6mLklCL54eGpeJk+CRRnkqLgtTYAxBW+l25GIEFQT73hg8hev3lt+Ck04622p2jTmt3bxbHMltj62yDRWyb7bIDdsi6TLC/7IbdsfvgMrgN/gUPz60Lwczzk81V8PgEaq60Fg==</latexit>

InfoNCE (Logeswaran & Lee, 2018; van den Oord et al., 2019)

Page 16: Large Pretrained Models - nlp.cs.hku.hk

Mutual Information

Ep(a,b)

2

4f✓(a, b)� logX

b̃2B

exp f✓(a, b̃)

3

5 .

<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit>

Cross Entropy (Softmax)

“Hope” “Fear”

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

a b

Page 17: Large Pretrained Models - nlp.cs.hku.hk

Masked Language Modeling

g!(a)<latexit sha1_base64="sKPkJ29I6lhRZPL6GvfWNK72OUQ=">AAACH3icbVBNS8NAEN34WeNX1aOXYBHUQ0kqqMeiF48VrBaaUiabaVzc7IbdjVBC/okX/4oXD4qIN/+N21pBqwPLPt57w8y8KONMG9//cGZm5+YXFitL7vLK6tp6dWPzSstcUWxTyaXqRKCRM4FtwwzHTqYQ0ojjdXR7NtKv71BpJsWlGWbYSyERbMAoGEv1q0dhhAkTBXCWiIPSTfpFGEke62FqvyKUKSZQlnuw74Yo4m9fv1rz6/64vL8gmIAamVSrX30PY0nzFIWhHLTuBn5megUowyjH0g1zjRnQW0iwa6GAFHWvGN9XeruWib2BVPYJ443Znx0FpHq0sHWmYG70tDYi/9O6uRmc9AomstygoF+DBjn3jPRGYXkxU0gNH1oAVDG7q0dvQAE1NlLXhhBMn/wXXDXqwWG9cdGoNU8ncVTINtkheyQgx6RJzkmLtAkl9+SRPJMX58F5cl6dty/rjDPp2SK/yvn4BACMo4s=</latexit>

g (b)<latexit sha1_base64="ttH8YcZgdbzsc3jobdaI2YLi8ow=">AAACHXicbVDLSsNAFJ3UV42vqks3wSJUFyWpgi6LblxWsLXQhDCZ3KZDJ5MwMxFKyI+48VfcuFDEhRvxb5w+BG29MMzhnHO5954gZVQq2/4ySkvLK6tr5XVzY3Nre6eyu9eRSSYItEnCEtENsARGObQVVQy6qQAcBwzuguHVWL+7ByFpwm/VKAUvxhGnfUqw0pRfOXMDiCjPMaMRPynMyM/dIGGhHMX6y91U0qKoBcemCzz8cfmVql23J2UtAmcGqmhWLb/y4YYJyWLgijAsZc+xU+XlWChKGBSmm0lIMRniCHoachyD9PLJdYV1pJnQ6idCP66sCfu7I8exHK+rnTFWAzmvjcn/tF6m+hdeTnmaKeBkOqifMUsl1jgqK6QCiGIjDTARVO9qkQEWmCgdqKlDcOZPXgSdRt05rTduGtXm5SyOMjpAh6iGHHSOmugatVAbEfSAntALejUejWfjzXifWkvGrGcf/Snj8xtwB6K7</latexit>

View a — corrupted context of word i

View b — word i

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

Transformer

[MASK] [MASK]

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

What you want ?do

Page 18: Large Pretrained Models - nlp.cs.hku.hk

Next Sentence Prediction

Transformer

I feel like eating [SEP][CLS] What you want ? [SEP]ramen

NSP

do

Binary Classification — “local” NCE (Gutmann and Hyvarinen, 2012)

“global” NCE

Transformer

I feel like eating [SEP][CLS] ramen

What you want ? [SEP]do[CLS]

Transformer

[SEP][CLS]

Transformer

[SEP][CLS]

Transformer

[SEP][CLS]

Transformer

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

| B̃ |<latexit sha1_base64="lK4Vk0reLBOJvh3QwDhGbj4CvI4=">AAACIHicbVBNS8NAEN3Urxq/qh69BIsgHkpSD/VY6sVjBfsBTSibzaRdutmE3Y1QQn6KF/+KFw+K6E1/jZu2grY+GHi8N8PMPD9hVCrb/jRKa+sbm1vlbXNnd2//oHJ41JVxKgh0SMxi0fexBEY5dBRVDPqJABz5DHr+5Lrwe/cgJI35nZom4EV4xGlICVZaGlYarg8jyjPM6Ihf5KYb0cCNsBoTzDJXURZA1srzQjZd4MFP47BStWv2DNYqcRakihZoDysfbhCTNAKuCMNSDhw7UV6GhaKEgV6cSkgwmeARDDTlOALpZbMHc+tMK4EVxkIXV9ZM/T2R4UjKaeTrzuJ2uewV4n/eIFXhlZdRnqQKOJkvClNmqdgq0rICKoAoNtUEE0H1rRYZY4GJ0pmaOgRn+eVV0q3XnMta/bZebbYWcZTRCTpF58hBDdREN6iNOoigB/SEXtCr8Wg8G2/G+7y1ZCxmjtEfGF/fMLSkNw==</latexit>

Page 19: Large Pretrained Models - nlp.cs.hku.hk

Connections with Computer Vision

Deep InfoMax (DIM; Hjelm et al., 2019)

Page 20: Large Pretrained Models - nlp.cs.hku.hk

Type of Architecture

Encoders

Encoder-Decoders Decoders

Parameters are what we get from the pretraining process.

Pros for the “encoders” architecture:

Gets bidirectional context.

Easy to use in language understanding tasks!

Other members in the family:

Page 21: Large Pretrained Models - nlp.cs.hku.hk

BERT for Understanding

BERT

<CLS> This must be the greatest movie ever !

Positive / Negative

Page 22: Large Pretrained Models - nlp.cs.hku.hk

BERT for Generation

<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>

What

BERT

Page 23: Large Pretrained Models - nlp.cs.hku.hk

BERT for Generation

<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>

What

What

do

Input has been changed. The representations will need to be recomputed!

Not a very good idea…

BERT

Page 24: Large Pretrained Models - nlp.cs.hku.hk

Pretrained Models

— pretraining objective

— neural representation learner

— type of architecture

Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>

Page 25: Large Pretrained Models - nlp.cs.hku.hk

GPT (Generative Pretrained Transformer)

Radford et al., 2018

Decoders

Page 26: Large Pretrained Models - nlp.cs.hku.hk

Transformer as Decoder

Happy mid autumn festival

Need to prevent the attention the future words.

Happy

Happy

<s>

mid autumn festival

mid

autumn

festival

causal attention

Page 27: Large Pretrained Models - nlp.cs.hku.hk

GPT (Generative Pretrained Transformer)

Radford et al., 2018

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

Previous Context

Next Word

lookup table

Transformer

GIF credit: Lena Voita

Page 28: Large Pretrained Models - nlp.cs.hku.hk

GPT for Understanding

GPT

This must be the greatest movie ever !

Positive / Negative

Page 29: Large Pretrained Models - nlp.cs.hku.hk

GPT for Generation

GPT

This must be the greatest movie

ever

Page 30: Large Pretrained Models - nlp.cs.hku.hk

GPT for Generation

GPT

This must be the greatest movie ever

!

Just “grow” the transformer!

Page 31: Large Pretrained Models - nlp.cs.hku.hk

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

Encoder-Decoders

Page 32: Large Pretrained Models - nlp.cs.hku.hk

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020Thank you <X> me to your party <Y> week.

<X> for inviting <Y> last <Z>

Page 33: Large Pretrained Models - nlp.cs.hku.hk

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

Page 34: Large Pretrained Models - nlp.cs.hku.hk

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

Page 35: Large Pretrained Models - nlp.cs.hku.hk

ELMo (Embeddings from Language Models)

Encoders

Bidirectional Language Model

Peters et al., 2018

Page 36: Large Pretrained Models - nlp.cs.hku.hk

ELMo (Embeddings from Language Models)

Page 37: Large Pretrained Models - nlp.cs.hku.hk

BART (Denoising Sequence-to-Sequence Pre-training )

Lewis et al., 2018

Encoder-Decoders

Page 38: Large Pretrained Models - nlp.cs.hku.hk

BART (Denoising Sequence-to-Sequence Pre-training )

Lewis et al., 2018

Page 39: Large Pretrained Models - nlp.cs.hku.hk

InfoWord

Kong et al., 2019

Transformer

TransformerTransformer Transformer Transformer Transformer. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>

Global View

Local View

. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>

“Real” “Fake”