checkfreq - cs.utexas.edu

44
CheckFreq Frequent, Fine-Grained DNN Checkpointing Jayashree Mohan, Amar Phanishayee, Vijay Chidambaram 1

Upload: others

Post on 14-Jan-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CheckFreq - cs.utexas.edu

CheckFreqFrequent, Fine-Grained DNN Checkpointing

Jayashree Mohan, Amar Phanishayee, Vijay Chidambaram

1

Page 2: CheckFreq - cs.utexas.edu

Deep Neural Networks ( DNNs )

• Widely used for a variety of tasks

Image Classification

Cat Dog

Language Translation 2

Duck

Dog

Dog

Text To Speech

Object detection

Page 3: CheckFreq - cs.utexas.edu

DNN Training

Input Dataset

Minibatch Forward pass Backward pass Weight update

Iteration

Model

Wi, bi

Epoch = One complete pass over the dataset

Wi, bi

Learned model parameters

nepochs

DNN training is compute-intensive and time-consuming!

3

Page 4: CheckFreq - cs.utexas.edu

DNN Checkpointing

GPU time

Epoch boundaries

In memoryModel – random

weightsModel – learned

weights

In memoryMigration/Crash

Wasted GPU cycles

Any interruption can wipe out the model parameters learned so far in memory, restarting this expensive process!

4

Page 5: CheckFreq - cs.utexas.edu

DNN Checkpointing

GPU time

Epoch boundaries

In memoryModel – random

weightsModel – learned

weights

In memory

Checkpoint

Migration/Crash

Wasted GPU cycles

• Learned model parameters are written to persistent storage every so often during training for fault-tolerance:• The VMs may migrate, expire, or crash (e.g., spot instances), jobs may migrate (e.g.,

shared GPU clusters)

5

Page 6: CheckFreq - cs.utexas.edu

State of DNN Checkpointing Today

• Synchronous checkpoints => Large checkpoint stalls

• Manual checkpointing frequency => Typically performed at epoch boundaries

• But epoch times are increasing due to higher computational complexity of models and increasing dataset sizes

• Frequent interruptions : for e.g. preemptions in low-cost spot VMs

Need fine-grained, iteration-level checkpointing 6

Page 7: CheckFreq - cs.utexas.edu

Challenges for fine-grained checkpointing

Checkpointing frequency

How often to checkpoint?

Checkpointstalls

How to minimize the cost of a checkpoint?

Data invariant

How to resume correctly from a checkpoint?

• Every epoch processes all the items in the dataset exactly once, in a random shuffled order• Must hold when training resumes after an interruption in the middle of an epoch

7

Page 8: CheckFreq - cs.utexas.edu

Challenges for fine-grained checkpointing

Checkpointing frequency

Checkpointstalls

Data invariant

How often to checkpoint?

How to minimize the cost of a checkpoint?

How to resume correctly from a checkpoint?

Our work addresses these challenges to provide an automated, frequent checkpointing framework for DNN training

8

Page 9: CheckFreq - cs.utexas.edu

CheckFreq

• Fine-grained, automated checkpointing framework for DNN training

• Strikes a balance between low overhead and high frequency of checkpointing => new checkpointing policy and mechanism

• Exploits the DNN computational model to perform pipelined in-memory snapshots, GPU-based snapshots, and adaptive tuning of checkpointing frequency

• CheckFreq reduces the recovery time for popular DNNs from hours to seconds during job interruptions

Source code : https://github.com/msr-fiddle/CheckFreq

9

Page 10: CheckFreq - cs.utexas.edu

Outline

• Background and Motivation

• CheckFreq – Design• Checkpointing Mechanism

• Checkpointing Policy

• Evaluation

10

Page 11: CheckFreq - cs.utexas.edu

CheckFreq Design

How to perform correct, low-cost checkpointing?

Mechanism Policy

When to checkpoint?

2-phase DNN-aware checkpointing

Resumable data iterator

Systematic online profiling

Adaptive rate tuning

Low checkpoint stalls

Maintain data invariant

Initial checkpointing frequency

Manages interference from other jobs11

Page 12: CheckFreq - cs.utexas.edu

CheckFreq Design

How to perform correct, low-cost checkpointing?

Mechanism Policy

When to checkpoint?

2-phase DNN-aware checkpointing

Resumable data iterator

Systematic online profiling

Adaptive rate tuning

Low checkpoint stalls

Maintain data invariant

Initial checkpointing frequency

Manages interference from other jobs

Recovery Guarantees

12

Page 13: CheckFreq - cs.utexas.edu

Outline

• Background and Motivation

• CheckFreq – Design• Checkpointing Mechanism

• Checkpointing Policy

• Evaluation

13

Page 14: CheckFreq - cs.utexas.edu

2-Phase Checkpointing

• Synchronous checkpointing introduces checkpoint stalls => Runtime overhead

• Low-cost checkpointing mechanism that is split into a pipelined snapshot() and persist() phase

Snapshot() : Serialize and copy into a user-space buffer

Persist() : Write out the serialized contents to disk

14

Page 15: CheckFreq - cs.utexas.edu

Example

• Consider a policy that checkpoints every three iterations.

15

Page 16: CheckFreq - cs.utexas.edu

Weight update

Forward pass

Backward pass

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Example1 1 1Training (GPU)

16

Page 17: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

1 1 1

11

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Example

17

Page 18: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

1 1 1

11

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Example

18

Page 19: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Example

19

Page 20: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1

Example

20

Page 21: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1

1

Example

21

Page 22: CheckFreq - cs.utexas.edu

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

5 5 5 6 6 6

Example

22

Page 23: CheckFreq - cs.utexas.edu

1 1 1

1

2 2 2Training (GPU)

Checkpoint (CPU)

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

(c) Snapshot() and persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

5 5 5 6 6 6

Example

23

Page 24: CheckFreq - cs.utexas.edu

1 1 1 3 3 3

11

2 2 2 4 4 4Training (GPU)

Checkpoint (CPU)

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

(c) Snapshot() and persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

5 5 5 6 6 6

Example

24

Page 25: CheckFreq - cs.utexas.edu

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

5 5 5 6 6 6Training (GPU)

Checkpoint (CPU)

Weight update

Snapshotting

Disk IO

Forward pass

Backward pass

Checkpoint stall

(c) Snapshot() and persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

Training (GPU)

Checkpoint (CPU)

(a) Baseline : Synchronous checkpointing

Training (GPU)

Checkpoint (CPU)

(b) Only persist() pipelining

1 1 1 3 3 3

11

2 2 2 4 4 4

4 4

5 5 5 6 6 6

Example

25

Page 26: CheckFreq - cs.utexas.edu

GPU-optimized Snapshots

• Cost of serialization and snapshot() is upto 10x lower when done on the GPU

• To further reduce the checkpoint cost, CheckFreq snapshots on the GPU, and asynchronously writes it to CPU memory if it profiles spare memory on the GPU

• If GPU memory is fully utilized, it falls back to pipelined, CPU-side snapshots

26

Page 27: CheckFreq - cs.utexas.edu

Outline

• Background and Motivation

• CheckFreq – Design• Checkpointing Mechanism

• Checkpointing Policy

• Evaluation

27

Page 28: CheckFreq - cs.utexas.edu

Checkpointing policy

• Determines when to initiate a checkpoint

• Checkpoints every k iterations, such that • the cost of one checkpoint can be amortized over k iterations

• Runtime overhead introduced due to checkpointing is within a small user-given percentage of the actual compute time (say 5%)

28

Page 29: CheckFreq - cs.utexas.edu

Systematic Online Profiling

• CheckFreq’s data iterator automatically profiles several iteration-level and checkpoint-specific metrics

Iteration timeTime for weight

updateTime for GPU

snapshot()Time for CPU

snapshot()

Available disk throughput Checkpoint size

Peak GPU memory util

Total GPU memory

Algorithmically determines the checkpointing frequency such that: • Overhead due to checkpoint stalls is within the user-given limit

29

Page 30: CheckFreq - cs.utexas.edu

Outline

• Background and Motivation

• CheckFreq – Design• Checkpointing Mechanism

• Checkpointing Policy

• Evaluation

30

Page 31: CheckFreq - cs.utexas.edu

Experimental Setup

• Checkfreq is integrated with PyTorch• Uses the state-of-the-art NVIDIA DALI data loading library to support

resumability

• Experiments are performed on two different servers from an internal GPU cluster at Microsoft

1. Conf-Volta : Server with eight V100 GPUs (32GiB), with a SSD

2. Conf-Pascal : Server with eight 1080Ti GPUs (11GiB), with a HDD

31

Page 32: CheckFreq - cs.utexas.edu

Models and Experiments

• We evaluate CheckFreq on 7 different DNNs : • ResNet18, ResNet50, ResNext101, DenseNet121, VGG16, InceptionV3 on Imagenet-1k

• Bert-Large pretraining on Wikipedia & BookCorpus dataset

• Experiments to evaluate:

Accuracy implications of data invariant

Checkpoint stalls

Recovery Time

Breakdown of benefits due to pipelining

Adaptive frequency tuning

End-to-end training with interruptions

32

Page 33: CheckFreq - cs.utexas.edu

Models and Experiments

• We evaluate CheckFreq on 7 different DNNs : • ResNet18, ResNet50, ResNext101, DenseNet121, VGG16, InceptionV3 on Imagenet-1k

• Bert-Large pretraining on Wikipedia & BookCorpus dataset

• Experiments to evaluate:

Accuracy implications of data invariant

Checkpoint stalls

Recovery Time

Breakdown of benefits due to pipelining

Adaptive frequency tuning

End-to-end training with interruptions

33

Page 34: CheckFreq - cs.utexas.edu

CheckFreq reduces checkpoint stalls

• Train VGG16 for 300 iterations on Conf-Volta• Checkpointing mechanisms :

• Synchronous• Persist() pipelining only• CheckFreq - Persist() and snapshot() pipelining

• Checkpointing frequency : 15 iterations

34

Page 35: CheckFreq - cs.utexas.edu

CheckFreq reduces checkpoint stalls

0

1

2

3

4

5

1 9

17

25

33

41

49

57

65

73

81

89

97

10

5

11

3

12

1

12

9

13

7

14

5

15

3

16

1

16

9

17

7

18

5

19

3

20

1

20

9

21

7

22

5

23

3

24

1

24

9

25

7

26

5

27

3

28

1

28

9

29

7

Tim

e f

or

a it

era

tio

n (

s)

Iteration #

Synchronous

35

Page 36: CheckFreq - cs.utexas.edu

CheckFreq reduces checkpoint stalls

0

1

2

3

4

5

1 9

17

25

33

41

49

57

65

73

81

89

97

10

5

11

3

12

1

12

9

13

7

14

5

15

3

16

1

16

9

17

7

18

5

19

3

20

1

20

9

21

7

22

5

23

3

24

1

24

9

25

7

26

5

27

3

28

1

28

9

29

7

Tim

e f

or

a it

era

tio

n (

s)

Iteration #

Synchronous

36

Page 37: CheckFreq - cs.utexas.edu

CheckFreq reduces checkpoint stalls

• Performing asynchronous IO reduces checkpoint cost by 2x but still results in significant stalls

0

1

2

3

4

5

1 9

17

25

33

41

49

57

65

73

81

89

97

10

5

11

3

12

1

12

9

13

7

14

5

15

3

16

1

16

9

17

7

18

5

19

3

20

1

20

9

21

7

22

5

23

3

24

1

24

9

25

7

26

5

27

3

28

1

28

9

29

7

Tim

e f

or

a it

era

tio

n (

s)

Iteration #

Synchronous Persist() pipelining

37

Page 38: CheckFreq - cs.utexas.edu

CheckFreq reduces checkpoint stalls

• CheckFreq further reduces stalls by carefully pipelining checkpointing with compute

0

1

2

3

4

5

1 9

17

25

33

41

49

57

65

73

81

89

97

10

5

11

3

12

1

12

9

13

7

14

5

15

3

16

1

16

9

17

7

18

5

19

3

20

1

20

9

21

7

22

5

23

3

24

1

24

9

25

7

26

5

27

3

28

1

28

9

29

7

Tim

e f

or

a it

era

tio

n (

s)

Iteration #

Synchronous Persist() pipelining CheckFreq

38

Page 39: CheckFreq - cs.utexas.edu

Overall Training Overhead

0

10

20

30

40

50

60

70

80

Res50 ResNext Res18 Inception VGG16 DenseNet BERT

Pe

rce

nt

Ove

rhe

adBaseline CheckFreq

39

Page 40: CheckFreq - cs.utexas.edu

Overall Training Overhead

• When the baseline checkpointing mechanism is performed at a frequency chosen by CheckFreq, it introduces 20 – 70% overhead in training time

0

10

20

30

40

50

60

70

80

Res50 ResNext Res18 Inception VGG16 DenseNet BERT

Pe

rce

nt

Ove

rhe

adBaseline CheckFreq

40

Page 41: CheckFreq - cs.utexas.edu

CheckFreq lowers recovery time

ModelEpoch-based

(s)CheckFreq

(s)

Res18

Res50

VGG16

ResNext

DenseNet

Inception

BERT

• Recovery time : Time spent by the model to recover to the same state as it was before interruption

41

Page 42: CheckFreq - cs.utexas.edu

CheckFreq lowers recovery time

ModelEpoch-based

(s)CheckFreq

(s)

Res18 840 5

Res50 2100 24

VGG16 5700 25

ResNext 7080 32

DenseNet 2340 7

Inception 3000 27

BERT 4920 85

• CheckFreq reduces recovery time during an interruption from hours to seconds

• Recovery time : Time spent by the model to recover to the same state as it was before interruption

42

Page 43: CheckFreq - cs.utexas.edu

Conclusion

• CheckFreq provides an automatic, fine-grained checkpointing framework for DNN training

• CheckFreq allows frequent checkpointing while incurring a low cost

• When the job is interrupted, CheckFreq reduces recovery time for popular DNNs from hours to seconds

43

Page 44: CheckFreq - cs.utexas.edu

Thank you!

Contact : [email protected]

Source code : https://github.com/msr-fiddle/CheckFreq

44