training neural networks with mixed...
TRANSCRIPT
![Page 1: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/1.jpg)
1
Training Neural Networks
with Mixed Precision
MICHAEL CARILLI
CHRISTIAN SAROFEEN
MICHAEL RUBERRY
BEN BARSDELL
![Page 2: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/2.jpg)
2
THIS TALK
Using mixed precision and Volta your networks can be:
1. 2-4x faster
2. half the memory use
3. just as powerful
with no architecture change.
![Page 3: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/3.jpg)
3
TALK OVERVIEW
1. Introduction to Mixed Precision Training
2. Mixed Precision Example in PyTorch
3. Appendix: Mixed Precision Example in TensorFlow
![Page 4: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/4.jpg)
4
REFERENCES
• Paulius Micikevicius's talk "Training Neural Networks with Mixed
Precision: Theory and Practice" (GTC 2018, S8923).
• "Mixed Precision Training" (ICLR 2018).
• "Mixed- Precision Training of Deep Neural Networks" (NVIDIA Parallel
Forall Blog).
• "Training with Mixed Precision" (NVIDIA User Guide).
(Further Reading)
![Page 5: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/5.jpg)
5
SINGLE VS HALF PRECISION
1x compute throughput
1x memory throughput
1x size
32 bit precision
FP32 FP16 + Volta
8X compute throughput
2X memory throughput
1/2X size
16 bit precision
![Page 6: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/6.jpg)
6
MIXED PRECISION APPROACH
"Master" weights in FP32
Loss (Gradient) Scaling
Accumulate in FP32
Imprecise weight updates
Gradients may underflow
Maintain precision for
reductions (sums, etc)
![Page 7: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/7.jpg)
7
SUCCESS STORIES: SPEED
NVIDIA Sentiment Analysis: FAIRSeq:
GNMT:
Resnet152:
4.5X speedup*4X speedup2X speedup
2.2X speedup
*single Volta, Mixed Precision vs pure FP32
Pytorch
TensorFlow
![Page 8: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/8.jpg)
8
SUCCESS STORIES: ACCURACY
*Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training" (ICLR 2018)**Same hyperparameters and learning rate schedule as FP32.
Model FP32 Mixed Precision**
AlexNet 56.77% 56.93%
VGG-D 65.40% 65.43%
GoogLeNet (Inception v1) 68.33% 68.43%
Inception v2 70.03% 70.02%
Inception v3 73.85% 74.13%
Resnet50 75.92% 76.04%
Mixed precision can match FP32 with no change in hyperparameters.
ILSVRC12 classification top-1 accuracy*
![Page 9: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/9.jpg)
9
Progressive Growing of GANs: Generates 1024x1024 face images
http://research.nvidia.com/publication/2017-10_Progressive-Growing-of
No perceptible difference between FP32 and mixed-precision training
Loss-scaling:
Separate scaling factors for generator and discriminator (you are training 2 networks)
Automatic loss scaling greatly simplified training – gradient stats shift drastically when image resolution is increased
SUCCESS STORIES: ACCURACY
![Page 10: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/10.jpg)
10
THIS TALK (REPRISE)
Using mixed precision and Volta your networks can be:
1. 2-4x faster
2. half the memory use
3. just as powerful
with no architecture change.
![Page 11: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/11.jpg)
11
TENSOR CORE PERFORMANCE TIPS
• Convolutions:Batch size, input channels, output channels should be multiples of 8.
• GEMM:For A x B where A has size (N, M) and B has size (M, K), N, M, K should be multiples of 8.
• Fully connected layers are GEMMs:Batch size, input features, output features should be multiples of 8.
Libraries (cuDNN, cuBLAS) are optimized for Tensor Cores.
![Page 12: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/12.jpg)
12
TENSOR CORE PERFORMANCE TIPS
How can I make sure Tensor Cores were used?Run one iteration with nvprof, and look for “884” kernels:
import torch
import torch.nn
bsz, in, out = 256, 1024, 2048
tensor = torch.randn(bsz, in).cuda().half()
layer = torch.nn.Linear(in, out).cuda().half()
layer(tensor)
Running with$ nvprof python test.py
...
37.024us 1 37.024us 37.024us 37.024us volta_fp16_s884gemm_fp16…
![Page 13: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/13.jpg)
13
TENSOR CORE PERFORMANCE TIPS
If your data/layer sizes are constant each iteration, try
import torch
torch.backends.cudnn.benchmark = True
...
This enables cuDNN’s autotuner. The first iteration, it will try many
algorithms, and choose the fastest to use in later iterations.
See https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-
benchmark-do/5936
![Page 14: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/14.jpg)
14
PYTORCH EXAMPLE
![Page 15: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/15.jpg)
15
A SIMPLE NETWORK
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda()
y = Variable(torch.randn(N, D_out)).cuda()
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
![Page 16: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/16.jpg)
16
CONVERT TO FP16
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
This may be all you need to do!
![Page 17: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/17.jpg)
17
CONVERT TO FP16
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
This may be all you need to do!
Sometimes you need to use mixed precision.
![Page 18: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/18.jpg)
18
MIXED PRECISION APPROACH
"Master" weights in FP32Imprecise weight updates
![Page 19: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/19.jpg)
19
When update/param < 2-11, updates have no effect.
IMPRECISE WEIGHT UPDATES
1.0001FP32:param = torch.cuda.FloatTensor([1.0])
print(param + 0.0001)
1 + 0.0001 = ?
1FP16:param = torch.cuda.HalfTensor([1.0])
print(param + 0.0001)
![Page 20: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/20.jpg)
20
IMPRECISE WEIGHT UPDATES
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step() Small FP16 weight updates may be lost
![Page 21: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/21.jpg)
21
MIXED SOLUTION: FP32 MASTER WEIGHTS
FP32 Master
Weights
1. Forward Pass
FP32 Master
Gradients
FP16
Gradients
FP16
Weights
4. Apply
5. Copy
FP16 Loss
![Page 22: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/22.jpg)
22
MIXED SOLUTION: FP32 MASTER WEIGHTSHelper Functions
def prep_param_lists(model):
model_params = [p for p in model.parameters()
if p.requires_grad]
master_params = [p.clone().detach().float()
for p in model_params]
for p in master_params:
p.requires_grad = True
return model_params, master_params
FP32 Master
Weights
Note: Model<->master param and gradient copies act on .data members. They are not recorded by Pytorch’s autograd system.
![Page 23: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/23.jpg)
23
MIXED SOLUTION: FP32 MASTER WEIGHTS
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_lists(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
model.zero_grad()
loss.backward()
optimizer.step()
FP32 Master
WeightsOptimizer updates FP32 master params
Zero FP16 grads by calling model.zero_grad() instead of optimizer.zero_grad()
![Page 24: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/24.jpg)
24
MIXED SOLUTION: FP32 MASTER WEIGHTSHelper Functions
FP32 Master
Weights
FP16
Weights
Copy
def master_params_to_model_params(model_params, master_params):
for model, master in zip(model_params, master_params):
model.data.copy_(master.data)
![Page 25: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/25.jpg)
25
MIXED SOLUTION: FP32 MASTER WEIGHTS
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_lists(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
model.zero_grad()
loss.backward()
optimizer.step()
master_params_to_model_params(model_params, master_params)
FP32 Master
Weights
FP16
Weights
Copy
![Page 26: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/26.jpg)
26
def model_grads_to_master_grads(model_params, master_params):
for model, master in zip(model_params, master_params):
if master.grad is None:
master.grad = Variable(
master.data.new(*master.data.size()))
master.grad.data.copy_(model.grad.data)
FP16
Gradients
FP32 Master
Gradients
Copy
MIXED SOLUTION: FP32 MASTER WEIGHTSHelper Functions
![Page 27: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/27.jpg)
27
MIXED SOLUTION: FP32 MASTER WEIGHTS
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_lists(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
model.zero_grad()
loss.backward()
model_grads_to_master_grads(model_params, master_params)
optimizer.step()
master_params_to_model_params(model_params, master_params)
FP16
Gradients
FP32 Master
Gradients
Copy
![Page 28: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/28.jpg)
28
MIXED SOLUTION: FP32 MASTER WEIGHTS
FP32 Master
Weights
1. Forward Pass
FP32 Master
Gradients
FP16
Gradients
FP16
Weights
4. Apply
5. Copy
FP16 Loss
![Page 29: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/29.jpg)
29
MIXED PRECISION APPROACH
Loss (Gradient) ScalingGradients may underflow
![Page 30: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/30.jpg)
30
GRADIENTS MAY UNDERFLOW
N, D_in, D_out = 64, 1024, 512
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_lists(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
model.zero_grad()
loss.backward()
model_grads_to_master_grads(model_params, master_params)
optimizer.step()
master_params_to_model_params(model_params, master_params)
![Page 31: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/31.jpg)
Activations
Weights
WeightGradients
ActivationGradients
31
![Page 32: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/32.jpg)
Activations
Weights
WeightGradients
ActivationGradients
~40 powers of 2Range representable in FP16:
32
![Page 33: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/33.jpg)
Activations
Weights
WeightGradients
ActivationGradients
Range representable in FP16:
Gradients are small, don’t use much of FP16 range
~15 powers of 2FP16 range not used by gradients:
33
~40 powers of 2
![Page 34: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/34.jpg)
Activations
Weights
WeightGradients
ActivationGradients
Range representable in FP16: ~40 powers of 2
Gradients are small, don’t use much of FP16 range
~15 powers of 2FP16 range not used by gradients:
34
Loss Scaling
1. Multiply the loss by some constant S.
2. Call backward() on scaled loss. By chain rule, gradients will also be scaled by S.This preserves small gradient values.
3. Unscale gradients before update step().
![Page 35: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/35.jpg)
35
N, D_in, D_out = 64, 1024, 512
scale_factor = 128.0
......
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred,y)
scaled_loss = scale_factor * loss.float()
model.zero_grad()
scaled_loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master_params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
MIXED SOLUTION: LOSS (GRADIENT) SCALING
Gradients are now rescaled
to be representable
The FP32 master gradients
must be "descaled"
![Page 36: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/36.jpg)
36
FP32 Master
Weights
1. Forward Pass
FP32 Master
Gradients
FP16
Gradients
FP16
Weights
4. Apply
5. Copy
FP16 Loss
MASTER WEIGHTS
![Page 37: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/37.jpg)
37
FP32 Master
Weights
MASTER WEIGHTS + LOSS SCALING
FP32
Loss
2. Loss Scaling
Scaled FP32
Loss
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
5. Remove scale,
(+clip, etc.)
1. Forward Pass
6. Apply
7. Copy
![Page 38: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/38.jpg)
38
LOSS SCALING FAQ
1. Does loss scaling require retuning the learning rate?
No. Loss scaling is orthogonal to learning rate. Changing loss scale should not require retuning other hyperparameters.
2. Can the loss scale be adjusted each iteration?
Yes. For example, use larger loss scale later in training, when gradients are smaller.
Dynamic loss scaling adjusts loss scale automatically (more on that later)
![Page 39: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/39.jpg)
39
MIXED PRECISION APPROACH
Accumulate in FP32Maintain precision for
reductions (sums, etc)
![Page 40: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/40.jpg)
40
a = torch.cuda.HalfTensor(4094).fill_(16.0)
a.sum()
Reductions like sum() can overflow if > 65,504 is encountered.
Behavior may depend on Pytorch version.
REDUCTIONS MAY OVERFLOW
65,504
b = torch.cuda.HalfTensor(4095).fill_(16.0)
b.sum() inf
In PyTorch 0.5:
![Page 41: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/41.jpg)
41
N, D_in, D_out = 64, 1024, 512
scale_factor = 128.0
......
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred,y)
scaled_loss = scale_factor * loss.float()
model.zero_grad()
scaled_loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master.params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
REDUCTIONS MAY OVERFLOW
mse_loss is a reduction
![Page 42: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/42.jpg)
42
N, D_in, D_out = 64, 1024, 512
scale_factor = 128.0
......
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred.float(),y.float())
scaled_loss = scale_factor * loss
model.zero_grad()
scaled_loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master.params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
MIXED SOLUTION: ACCUMULATE IN FP32
loss is now FP32
(Model grads are still FP16)
![Page 43: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/43.jpg)
43
OTHER REDUCTIONS
BatchNorm involves a reduction. BatchNorm may also need to be done in FP32:
def BN_convert_float(module):
if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
module.float()
for child in module.children():
BN_convert_float(child)
return module
![Page 44: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/44.jpg)
44
MIXED PRECISION APPROACH
"Master" weights in FP32
Loss (Gradient) Scaling
Accumulate in FP32
Imprecise weight updates
Gradients may underflow
Maintain precision for
reductions (sums, etc)
![Page 45: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/45.jpg)
45
ADDITIONAL CONSIDERATIONS
1. Save master weights.
2. Save gradient scale factor.
Checkpointing
![Page 46: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/46.jpg)
46
ADDITIONAL CONSIDERATIONS
Dynamic loss scaling
Prevent gradient UNDERflow by using the highest loss scale that does not cause gradient OVERflow.
1. Start with a large loss scale (e.g. 2^32)
2. After each iteration, check if gradients overflowed (NaN or +/- Inf).
3. If gradients overflowed, discard that iteration by skipping optimizer.step().
4. If gradients overflowed, reduce S for the next iteration (e.g. S = S/2)
5. If N (e.g. 1000) iterations pass with no overflow, increase S again (S = S*2).
More detail:https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScalerExample:https://github.com/NVIDIA/apex/blob/ea93767d22818bdd88ae738a8c7cf62b49a8fdaf/apex/fp16_utils/loss_scaler.py#L134-L186
![Page 47: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/47.jpg)
47
NVIDIA MIXED PRECISION TOOLS
• All utility functions in this talk (model_grads_to_master_grads, etc.)
• FP16_Optimizer: Optimizer wrapper that automatically manages loss scaling + master params
• Closure-safe
• Option to automatically manage dynamic loss scaling
• Compatible with Pytorch distributed training
APEX – A PyTorch Extension
www.github.com/nvidia/apex
Documentation: https://nvidia.github.io/apex/fp16_utils.html
![Page 48: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/48.jpg)
48
TAIWAN | MAY 30-31,2018
www.nvidia.com/zh-tw/gtc
#GTC18
![Page 49: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/49.jpg)
49
TENSORFLOW EXAMPLE
![Page 50: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/50.jpg)
50
TENSORFLOW MIXED PRECISION SUPPORT
TensorFlow supports mixed precision using tf.float32 and tf.float16 data types
Reduce memory and bandwidth by using float16 tensors
Improve compute performance by using float16 matmuls and convolutions
Maintain training accuracy by using mixed precision
![Page 51: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/51.jpg)
51
MIXED PRECISION TRAININGCONVERSION STEPS
1. Convert model to float16 data type
2. Use float32 for certain ops to avoid overflow or underflow
Reductions (e.g., norm, softmax)
Range-expanding math functions (e.g., exp, pow)
3. Use float32 for weights storage to avoid imprecise training updates
4. Apply loss scaling to avoid gradients underflow
![Page 52: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/52.jpg)
52
EXAMPLE (PART 1)
import tensorflow as tf
import numpy as np
def build_forward_model(inputs):
_, _, h, w = inputs.get_shape().as_list()
top_layer = inputs
top_layer = tf.layers.conv2d(top_layer, 64, 7, use_bias=False,
data_format='channels_first', padding='SAME')
top_layer = tf.contrib.layers.batch_norm(top_layer, data_format='NCHW', fused=True)
top_layer = tf.layers.max_pooling2d(top_layer, 2, 2, data_format='channels_first')
top_layer = tf.reshape(top_layer, (-1, 64 * (h // 2) * (w // 2)))
top_layer = tf.layers.dense(top_layer, 128, activation=tf.nn.relu)
return top_layer
Simple CNN model
![Page 53: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/53.jpg)
53
EXAMPLE (PART 1)
import tensorflow as tf
import numpy as np
def build_forward_model(inputs):
_, _, h, w = inputs.get_shape().as_list()
top_layer = inputs
top_layer = tf.layers.conv2d(top_layer, 64, 7, use_bias=False,
data_format='channels_first', padding='SAME')
top_layer = tf.contrib.layers.batch_norm(top_layer, data_format='NCHW', fused=True)
top_layer = tf.layers.max_pooling2d(top_layer, 2, 2, data_format='channels_first')
top_layer = tf.reshape(top_layer, (-1, 64 * (h // 2) * (w // 2)))
top_layer = tf.layers.dense(top_layer, 128, activation=tf.nn.relu)
return top_layer
Convolutions support Tensor Cores
Matrix multiplications support Tensor Cores
Majority of TF ops support float16 data type
Batchnorm supports mixed precision
Simple CNN model
![Page 54: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/54.jpg)
54
EXAMPLE (PART 2)
def build_training_model(inputs, labels, nlabel):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
(We’ll come back to this)
![Page 55: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/55.jpg)
55
EXAMPLE (PART 3)
nchan, height, width, nlabel = 3, 224, 224, 1000
inputs = tf.placeholder(tf.float32, (None, nchan, height, width))
labels = tf.placeholder(tf.int32, (None,))
inputs, labels, loss, train_op = build_training_model(inputs, labels, nlabel)
batch_size = 128
sess = tf.Session()
inputs_np = np.random.random(size=(batch_size, nchan, height, width)).astype(np.float32)
labels_np = np.random.randint(nlabel, size=(batch_size,)).astype(np.int32)
sess.run(tf.global_variables_initializer())
for step in range(20):
loss_np, _ = sess.run([loss, train_op],
{inputs: inputs_np,
labels: labels_np})
print("Loss =", loss_np)
Training boilerplate
![Page 56: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/56.jpg)
56
ORIGINAL GRAPH
Input
Forward model
Loss
Backward model
Weights Weight gradientsUpdate
float32
float16
![Page 57: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/57.jpg)
57
STEP 1: CONVERSION TO FP16
def build_training_model(inputs, labels, nlabel):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
![Page 58: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/58.jpg)
58
STEP 1: CONVERSION TO FP16
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
![Page 59: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/59.jpg)
59
STEP 1: CONVERSION TO FP16
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
Gradients and variables are float16 too
![Page 60: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/60.jpg)
60
STEP 1: CONVERSION TO FP16
Input
Forward model
Loss
Backward model
Weights Weight gradientsUpdate
Cast to fp16
float32
float16
![Page 61: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/61.jpg)
61
NEW GRAPH
Input
Forward model
Loss
Backward model
Weights Weight gradientsUpdate
Cast to fp16
float32
float16
![Page 62: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/62.jpg)
62
STEP 2: USE FP32 TO COMPUTE THE LOSS
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
![Page 63: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/63.jpg)
63
STEP 2: USE FP32 TO COMPUTE THE LOSS
Input
Forward model
Loss
Backward model
Weights Weight gradientsUpdate
Cast to fp16
Cast to fp32
float32
float16
![Page 64: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/64.jpg)
64
NEW GRAPH
Input
Forward model
Loss
Backward model
Weights Weight gradientsUpdate
Cast to fp32 Cast to fp16
Cast to fp16
float32
float16
![Page 65: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/65.jpg)
65
STEP 3: FP32 MASTER WEIGHTS
def float32_variable_storage_getter(getter, name, shape=None, dtype=None,
initializer=None, regularizer=None,
trainable=True,
*args, **kwargs):
"""Custom variable getter that forces trainable variables to be stored in
float32 precision and then casts them to the training precision.
"""
storage_dtype = tf.float32 if trainable else dtype
variable = getter(name, shape, dtype=storage_dtype,
initializer=initializer, regularizer=regularizer,
trainable=trainable,
*args, **kwargs)
if trainable and dtype != tf.float32:
variable = tf.cast(variable, dtype)
return variable
Helper function
![Page 66: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/66.jpg)
66
STEP 3: FP32 MASTER WEIGHTS
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
Use of helper function
![Page 67: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/67.jpg)
67
STEP 3: FP32 MASTER WEIGHTS
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
Gradients and variables are now float32,but the gradient computations still use float16.
Use of helper function
![Page 68: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/68.jpg)
68
STEP 3: FP32 MASTER WEIGHTS
Input
Forward model
Loss
Backward model
WeightsUpdate
Cast to fp32 Cast to fp16
Cast to fp16 Cast to fp16
Weight gradients
float32
float16
![Page 69: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/69.jpg)
69
NEW GRAPH
Input
Forward model
Loss
Backward model
Weights Cast to fp32
Cast to fp32
Cast to fp16 Cast to fp16 Weight gradients
Update
Cast to fp16
float32
float16
![Page 70: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/70.jpg)
70
STEP 4: LOSS (GRADIENT) SCALING
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
Avoid gradient underflow
![Page 71: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/71.jpg)
71
STEP 4: LOSS (GRADIENT) SCALING
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
Returns gradients (now float32) to correct exponent
Raises exponent during bwd pass in float16
Avoid gradient underflow
![Page 72: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/72.jpg)
72
STEP 4: LOSS (GRADIENT) SCALING
Input
Forward model
Loss
Backward model
Weights Cast to fp32Update
Cast to fp32Cast to fp16
Cast to fp16 Cast to fp16 Weight gradients
* loss_scale
/ loss_scale
float32
float16
![Page 73: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/73.jpg)
73
NEW GRAPH
Input
Forward model
Loss
Backward model
Weights Cast to fp32Update
Cast to fp32Cast to fp16
Cast to fp16 Cast to fp16 Weight gradients
* loss_scale
/ loss_scale
float32
float16
![Page 74: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/74.jpg)
74
OPTIONAL EXTRA: GRADIENT CLIPPING
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
![Page 75: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/75.jpg)
75
ALL TOGETHER
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars', custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
![Page 76: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/76.jpg)
76
CONCLUSIONS
Mixed precision training is now well supported by deep learning frameworks
Conversion requires < 10 lines of code for most training scripts
For more info and frameworks, see our Mixed Precision Training guide:
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
![Page 77: Training Neural Networks with Mixed Precisionon-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speak… · Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN](https://reader033.vdocuments.mx/reader033/viewer/2022053002/5f06e4ae7e708231d41a4249/html5/thumbnails/77.jpg)
77
TAIWAN | MAY 30-31,2018
www.nvidia.com/zh-tw/gtc
#GTC18