offload annotations: bringing heterogeneous computing to … · offload annotations: bringing...
TRANSCRIPT
![Page 1: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/1.jpg)
Offload Annotations: Bringing Heterogeneous Computing to
Existing Libraries and WorkloadsGina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia
Stanford UniversityUSENIX ATC 2020 (July 15-17)
![Page 2: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/2.jpg)
Background: Hardware Commoditization
2
NVIDIA GPU
![Page 3: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/3.jpg)
Background: CPUs vs. GPUs
Core Core
Core CoreControl
Cache
Memory Memory
4-way parallelism512GB memory
1000-ways parallelism!16GB memory
CPUs GPUs
(PCI-E)
Costly data transfers!
3
![Page 4: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/4.jpg)
Background: Data Science on the CPU
+(CPU)
4Popular Python data science libraries for the CPU.
![Page 5: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/5.jpg)
NVIDIA GPU
Trend: Data Science on the GPU
+
5NEW Python data science libraries for the GPU.
(cuDF, cuML, etc.)
Lots of parallel data!
![Page 6: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/6.jpg)
Trend: CPU Libraries vs. GPU Librarieshttps://github.com/rapidsai/cudf https://cupy.chainer.org/
https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html
https://github.com/rapidsai/cuml
6
![Page 7: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/7.jpg)
Trend: CPU Libraries vs. GPU Librarieshttps://github.com/rapidsai/cudf https://cupy.chainer.org/
https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html
https://github.com/rapidsai/cuml
Are GPU libraries as straightforward to use as they seem?
7
![Page 8: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/8.jpg)
Motivating Example
8
cuML
![Page 9: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/9.jpg)
cumlcuml
Motivating Example
Missing Functions
9
![Page 10: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/10.jpg)
cumlcuml
Motivating Example
Missing Functions
Manual Data Transfers
X_train = transfer(X_train, GPU)
Y_train = transfer(Y_train, GPU)
X_test = transfer(X_test, GPU)
result = transfer(result, CPU)
10
![Page 11: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/11.jpg)
cumlcuml
Motivating Example
Missing Functions
Manual Data Transfers
Small GPU Memory
X_train = transfer(X_train, GPU)
Y_train = transfer(Y_train, GPU)
X_test[i,j]=transfer(X_test[i,j], GPU)
result[i,j]=transfer(result[i,j], CPU)
for (i,j) in split(X_test):
[i,j] [i,j])[i,j] [i,j])
[i,j] [i,j])
11
![Page 12: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/12.jpg)
cumlcuml
Motivating Example
Missing Functions
Manual Data Transfers
Small GPU Memory
Scheduling
X_train = transfer(X_train, GPU)
Y_train = transfer(Y_train, GPU)
X_test[i,j]=transfer(X_test[i,j], GPU)
result[i,j]=transfer(result[i,j], CPU)
for (i,j) in split(X_test):
[i,j] [i,j])[i,j] [i,j])
[i,j] [i,j])
??????
12
![Page 13: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/13.jpg)
Solution: Offload AnnotationsThe annotator writes offload annotations (OAs) for CPU libraries. An end user imports the annotated library instead of the CPU library. Our runtime, Bach, automatically schedules data transfers and pages computation.
13
![Page 14: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/14.jpg)
With less developer effort:1. Match handwritten GPU performance
Goals
14
![Page 15: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/15.jpg)
With less developer effort:1. Match handwritten GPU performance2. Scale to data sizes larger than GPU memory
Goals
15
![Page 16: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/16.jpg)
With less developer effort:1. Match handwritten GPU performance2. Scale to data sizes larger than GPU memory3. Beat CPU performance
Goals
16
![Page 17: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/17.jpg)
multiply = @oa(func=torch.mul)(np.multiply)sqrt = @oa(func=torch.sqrt)(np.sqrt)
Step 1: Annotator – Function Annotations
17
GPU library CPU library
![Page 18: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/18.jpg)
multiply = @oa(func=torch.mul)(np.multiply)sqrt = @oa(func=torch.sqrt)(np.sqrt)
Step 1: Annotator – Function Annotations
18
GPU library CPU library
corresponding functions
![Page 19: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/19.jpg)
inputs outputs
arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()
multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)
Step 1: Annotator – Function Annotations
19
![Page 20: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/20.jpg)
arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()
multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)ones = @oa_alloc(ret, func=torch.ones)(np.ones)
Step 1: Annotator – Allocation Annotations
20
Allocations only have a return type.
![Page 21: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/21.jpg)
arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()
multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)ones = @oa_alloc(ret, func=torch.ones)(np.ones)
Step 1: Annotator – Allocation Annotations
21
What’s in an offload split type?
"offload split type"
![Page 22: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/22.jpg)
API Description
device(value) Which device the value is on.
to(value, device) Transfers [value] to [device].offloading
API
Step 1: Annotator – Offload Split Type
22
![Page 23: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/23.jpg)
API Description
device(value) Which device the value is on.
to(value, device) Transfers [value] to [device].offloading
API
Step 1: Annotator – Offload Split Type
23
API Implementation for NdArrayType()
device(value) ...if isinstance(value, torch.Tensor): ...
to(value, device) ...value.to(torch.device('cpu')).numpy()
![Page 24: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/24.jpg)
API Descriptionsize(value) Number of elements in the value.split(start, end, value) Splits a value to enable paging.
merge(values) Merges split values.
splitting API[Mozart
SOSP ‘19](optional)
Step 1: Annotator – Offload Split Type
24
![Page 25: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/25.jpg)
API Descriptionsize(value) Number of elements in the value.split(start, end, value) Splits a value to enable paging.
merge(values) Merges split values.
splitting API[Mozart
SOSP ‘19](optional)
Step 1: Annotator – Offload Split Type
25
API Implementation for NdArrayType()size(value) return value.shape[-1]split(start, end, value) return value[start, end]merge(values) return np.concatenate(values)
![Page 26: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/26.jpg)
Step 1: Annotator – Offload Split Type
26
DataFrameType() ModelType()NdArrayType()
![Page 27: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/27.jpg)
import numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)
Step 2: End User
(Simple, somewhat dumb, Python program.)
27
End User ≠Annotator
![Page 28: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/28.jpg)
import bach.numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)
Import the annotated library instead of the CPU library.
Step 2: End User
28
![Page 29: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/29.jpg)
import bach.numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)
np.evaluate()
Step 2: End User
Explicitly materialize lazy values with included evaluate() function.
29
![Page 30: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/30.jpg)
Generate a lazy computation graph and do a topological sort.
Step 3: Runtime - Scheduling
?
?
?
?
?
1
2
4
5
3
np.sqrt(b)
12
45
3
a = np.ones()np.arcsin(a)
b = np.ones()np.multiply(a,b)
= Allocation
= CPU= GPU
30
Bach
![Page 31: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/31.jpg)
Assign functions to the CPU/GPU based on whether a GPU library implementation is provided in the annotation.
Step 3: Runtime - Scheduling
?
?
?
1
2
4
5
3
12
45
3
= Allocation
= CPU= GPU
No.
GPU library imple-
mentation provided?
Yes.
Yes. np.sqrt(b)
a = np.ones()np.arcsin(a)
b = np.ones()np.multiply(a,b)
31
Bach
![Page 32: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/32.jpg)
Assign allocations to the CPU/GPU so they are on the same device as the first function that uses the data.
Step 3: Runtime - Scheduling
?
?
?
1
2
4
5
3
12
45
3
= Allocation
= CPU= GPU
np.sqrt(b)
a = np.ones()np.arcsin(a)
b = np.ones()np.multiply(a,b)
32
Bach
![Page 33: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/33.jpg)
Automatically transfer data using the offloading API.
Step 3: Runtime – Offloading API
?
?
?
1
2
4
5
3
12
45
3
-- transfer to CPU --
-- transfer to GPU --
= Allocation
= CPU= GPU
np.sqrt(b)
a = np.ones()np.arcsin(a)
b = np.ones()np.multiply(a,b)
33
Bach
![Page 34: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/34.jpg)
np.sqrt(b)
a = np.ones()np.arcsin(a)
b = np.ones()np.multiply(a,b)
Automatically page large datasets using the splitting API.
Step 3: Runtime – Splitting API
?
?
?
1
2
4
5
3
12
45
3
Split
Merge
-- transfer to CPU --
-- transfer to GPU --
Data Size = 2^28
= Allocation
= CPU= GPU
34
Bach
![Page 35: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/35.jpg)
Naive cost-benefit analysis between data transfer and computation cost.
Step 3: Runtime – Scheduling Heuristics
?
?
?
?
1
2
4
5
3
= Allocation
= CPU= GPU
GPU Compute + Data Transfer
CPU Compute
Data Size = 2^28
(optional)
35
![Page 36: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/36.jpg)
?
?
?
?
1
2
4
5
3
= Allocation
= CPU= GPU
GPU Compute + Data Transfer
CPU Compute
Data Size = 2^10
36
Naive cost-benefit analysis between data transfer and computation cost.
Step 3: Runtime – Scheduling Heuristics (optional)
![Page 37: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/37.jpg)
Naive implementations of cost estimators.
Estimated Cost
Data Size
CPU Compute GPU Compute
+ Data Transfer
2^10 2^28
CPU GPU
37
Step 3: Runtime – Scheduling Heuristics (optional)
![Page 38: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/38.jpg)
Evaluation4 library integrations and 8 data science and ML workloads.
38
![Page 39: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/39.jpg)
~130 LOC per library including offloading / splitting APIs and function annotations.
Integration Experience
39
![Page 40: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/40.jpg)
Speedup: max 1200x, median 6.3x.
Evaluation: Summary
40
![Page 41: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/41.jpg)
Evaluation: Summary
41
![Page 42: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/42.jpg)
With less developer effort, Bach can:1. Match handwritten GPU
performance
Evaluation: Summary
42
![Page 43: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/43.jpg)
With less developer effort, Bach can:1. Match handwritten GPU
performance2. Scale to data sizes larger than
GPU memory
Evaluation: Summary
43
![Page 44: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/44.jpg)
With less developer effort, Bach can:1. Match handwritten GPU
performance2. Scale to data sizes larger than
GPU memory3. Beat CPU performance
Evaluation: Summary
44
2.3x
6.8x
![Page 45: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/45.jpg)
Crime Index saves time by eliminating the initial data transfer, while the allocation still fits in GPU memory.
In-Depth Evaluation: Allocations
45
4.6x
1.1x
![Page 46: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/46.jpg)
At smaller data sizes, TSVD schedules all computation on the CPU.
In-Depth Evaluation: Heuristics
46
11x
![Page 47: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/47.jpg)
[Motivating Example]The "fit" phase dominates the runtime until the "predict" phase can split/page data into the GPU.
In-Depth Evaluation: Splitting/Paging Datasets
47
6.2x
2.0x
![Page 48: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/48.jpg)
Evaluation: Summary
1.7x
6.9x 0.81x5.7x
1200x 6.8x
4.6x
11x
Max Speedup
48
![Page 49: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/49.jpg)
Evaluation: Summary
1.7x
6.9x 0.81x5.7x
1200x 6.8x
4.6x
11x
Max Speedup
49
![Page 50: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,](https://reader033.vdocuments.mx/reader033/viewer/2022051806/5ffceff9a71eb82da1449072/html5/thumbnails/50.jpg)
OAs enable heterogeneous GPU computing in existing libraries and workloads with little to no code modifications.
Conclusion
github.com/stanford-futuredata/[email protected]
With less developer effort, Bach + OAs can:§ Match handwritten GPU performance§ Scale to data sizes larger than GPU memory§ Beat CPU performance
50