gpufs: integrating a file system with gpus · cpu-gpu cpu gpus accelerated applications os. mark...
TRANSCRIPT
![Page 1: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/1.jpg)
Mark Silberstein - UT Austin 1
GPUfs:Integrating a file system with
GPUs
Mark Silberstein(UT Austin/Technion)
Bryan Ford (Yale), Idit Keidar (Technion)Emmett Witchel (UT Austin)
![Page 2: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/2.jpg)
Mark Silberstein - UT Austin 2
Traditional System Architecture
Applications
OS
CPU
![Page 3: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/3.jpg)
Mark Silberstein - UT Austin 3
Modern System Architecture
Manycoreprocessors
FPGAHybrid
CPU-GPUGPUsCPU
Accelerated applications
OS
![Page 4: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/4.jpg)
Mark Silberstein - UT Austin 4
Software-hardware gap is widening
Manycoreprocessors
FPGAHybrid
CPU-GPUGPUsCPU
Accelerated applications
OS
![Page 5: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/5.jpg)
Mark Silberstein - UT Austin 5
Software-hardware gap is widening
Manycoreprocessors
FPGAHybrid
CPU-GPUGPUsCPU
Accelerated applications
OSAd-hoc abstractions and management mechanisms
![Page 6: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/6.jpg)
Mark Silberstein - UT Austin 6
On-accelerator OS support closes the programmability gap
Manycoreprocessors
FPGAHybrid
CPU-GPUGPUsCPU
Accelerated applications
OS On-accelerator OS support
Native accelerator applications
Coordination
![Page 7: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/7.jpg)
Mark Silberstein - UT Austin 7
● GPUfs: File I/O support for GPUs● Motivation● Goals● Understanding the hardware● Design● Implementation● Evaluation
![Page 8: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/8.jpg)
Mark Silberstein - UT Austin 8
Building systems with GPUs is hard.Why?
![Page 9: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/9.jpg)
Mark Silberstein - UT Austin 9
Data transfersGPU invocation
Memory management
Goal of GPU programming frameworks
GPU
Parallel Algorithm
CPU
![Page 10: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/10.jpg)
Mark Silberstein - UT Austin 10
Headache for GPU programmers
Parallel Algorithm
GPU
Data transfersInvocation
Memory management
CPU
Half of the CUDA SDK 4.1 samples:at least 9 CPU LOC per 1 GPU LOC
![Page 11: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/11.jpg)
Mark Silberstein - UT Austin 11
GPU kernels are isolated
Parallel Algorithm
GPU
Data transfersInvocation
Memory management
CPU
![Page 12: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/12.jpg)
Mark Silberstein - UT Austin 12
Example: accelerating photo collage
http://www.codeproject.com/Articles/36347/Face-Collage
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers()}
![Page 13: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/13.jpg)
Mark Silberstein - UT Austin 13
CPU Implementation
CPUCPUCPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers()}
![Page 14: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/14.jpg)
Mark Silberstein - UT Austin 14
Offloading computations to GPU
CPUCPUCPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers()}
Move to GPU
![Page 15: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/15.jpg)
Mark Silberstein - UT Austin 15
Offloading computations to GPU
GPU
CPU
Kernel start
Datatransfer
Kernel termination
Co-processor programming model
![Page 16: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/16.jpg)
Mark Silberstein - UT Austin 16
Kernel start/stop overheads
CPU
GPU
copy to
GP
Uco
py to
CP
U
invoke
Invocationlatency
Synchronization
Cache flush
![Page 17: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/17.jpg)
Mark Silberstein - UT Austin 17
Hiding the overheads
CPU
GPU
copy to
GP
Uco
py to
CP
U
invoke
Manual data reuse managementAsynchronous invocation
Double buffering
copy to
GP
U
![Page 18: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/18.jpg)
Mark Silberstein - UT Austin 18
Implementation complexity
CPU
GPU
copy to
GP
Uco
py to
CP
U
invoke
Manual data reuse managementAsynchronous invocation
Double buffering
copy to
GP
U
Management overhead
![Page 19: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/19.jpg)
Mark Silberstein - UT Austin 19
Implementation complexity
CPU
GPU
copy to
GP
Uco
py to
CP
U
invoke
Manual data reuse managementAsynchronous invocation
Double buffering
copy to
GP
U
Why do we need to deal withlow-level system details?
Management overhead
![Page 20: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/20.jpg)
Mark Silberstein - UT Austin 20
The reason is....
GPUs are peer-processors
They need I/O OS services
![Page 21: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/21.jpg)
Mark Silberstein - UT Austin 21
GPUfs: application viewCPUs GPU1 GPU2 GPU3
open(“shared_file”)
mm
ap()
open(“shared_file”)w
rite(
)
Host File System
GPUfs
![Page 22: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/22.jpg)
Mark Silberstein - UT Austin 22
GPUfs: application viewCPUs GPU1 GPU2 GPU3
open(“shared_file”)
mm
ap()
open(“shared_file”)w
rite(
)
Host File System
GPUfs
System-wideshared
namespace
Persistentstorage
POSIX (CPU)-like API
![Page 23: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/23.jpg)
Mark Silberstein - UT Austin 23
Accelerating collage app with GPUfs
CPUCPUCPU
GPUfsGPUfs
open/read from GPU
GPU
No CPUmanagement code
![Page 24: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/24.jpg)
Mark Silberstein - UT Austin 24
CPUCPUCPU
GPUfs buffer cacheGPUfs
GPU
GPUfs
OverlappingOverlapping computations and transfers
Read-ahead
Accelerating collage app with GPUfs
![Page 25: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/25.jpg)
Mark Silberstein - UT Austin 25
CPUCPUCPU
GPUfs
GPU
Data reuse
Accelerating collage app with GPUfs
Random data access
![Page 26: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/26.jpg)
Mark Silberstein - UT Austin 26
Challenge
GPU ≠ CPU
![Page 27: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/27.jpg)
Mark Silberstein - UT Austin 27
Massive parallelism
NVIDIA Fermi* AMD HD5870*
From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*
23,000 active threads
31,000 active threads
Parallelism is essential for performance in deeply multi-threaded wide-vector hardware
![Page 28: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/28.jpg)
Mark Silberstein - UT Austin 28
Heterogeneous memory
CPU GPU
Memory Memory
10-32GB/s
6-16 GB/s
288-360GB/s
~x20
GPUs inherently impose high bandwidth demands on memory
![Page 29: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/29.jpg)
Mark Silberstein - UT Austin 29
How to build an FS layer on this hardware?
![Page 30: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/30.jpg)
Mark Silberstein - UT Austin 30
GPUfs: principled redesign of the whole file system stack
● Relaxed FS API semantics for parallelism
● Relaxed FS consistency for heterogeneous memory
● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, ….
![Page 31: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/31.jpg)
Mark Silberstein - UT Austin 31
GPU applicationusing GPUfs File API
OS File System Interface
GPUfs high-level design
GPU Memory(Page cache)CPU Memory
GPUfs Distributed Buffer Cache
Unchanged applicationsusing OS File API
GPUfs hooks GPUfs GPU File I/O library
OS
CPU GPU
Disk
Host File System
Massiveparallelism
Heterogeneousmemory
![Page 32: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/32.jpg)
Mark Silberstein - UT Austin 32
GPU applicationusing GPUfs File API
OS File System Interface
GPUfs high-level design
GPU Memory(Page cache)CPU Memory
GPUfs Distributed Buffer Cache
Unchanged applicationsusing OS File API
GPUfs hooks GPUfs GPU File I/O library
OS
CPU GPU
Disk
Host File System
![Page 33: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/33.jpg)
Mark Silberstein - UT Austin 33
Buffer cache semantics
Local or Distributed file systemdata consistency?
![Page 34: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/34.jpg)
Mark Silberstein - UT Austin 34
GPUfs buffer cacheWeak data consistency model
● close(sync)-to-open semantics (AFS)
write(1)
open() read(1)
GPU1
GPU2
fsync() write(2)
Not visible to CPU
Remote-to-Local memory performanceratio is similar to
a distributed system
>>
![Page 35: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/35.jpg)
Mark Silberstein - UT Austin 35
On-GPU File I/O API
open/close
read/write
mmap/munmap
fsync/msync
ftrunc
gopen/gclose
gread/gwrite
gmmap/gmunmap
gfsync/gmsync
gftrunc
In th
e pa
per
Changes in the semantics are crucial
![Page 36: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/36.jpg)
Mark Silberstein - UT Austin 36
Implementation bits
● Paging support ● Dynamic data structures and memory
allocators● Lock-free radix tree● Inter-processor communications (IPC)● Hybrid H/W-S/W barriers● Consistency module in the OS kernel
In t h
e p a
per
~1,5K GPU LOC, ~600 CPU LOC
![Page 37: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/37.jpg)
Mark Silberstein - UT Austin 37
Evaluation
All benchmarks are written as a GPU
kernel: no CPU-side development
![Page 38: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/38.jpg)
Mark Silberstein - UT Austin 38
Matrix-vector product(Inputs/Outputs in files)Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075
280 560 2800 5600 112000
500
1000
1500
2000
2500
3000
3500CUDA piplined CUDA optimized GPU file I/O
Input matrix size (MB)
Th
roug
hput
(M
B/s
)
![Page 39: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/39.jpg)
Mark Silberstein - UT Austin 39
Word frequency count in text
● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree
ChallengesDynamic working setSmall filesLots of file I/O (33,000 files,1-5KB each)Unpredictable output size
English dictionary: 58,000 words
![Page 40: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/40.jpg)
Mark Silberstein - UT Austin 40
Results
8CPUs GPU-vanilla GPU-GPUfs
Linux source33,000 files, 524MB
6h 50m (7.2X) 53m (6.8X)
Shakespeare1 file, 6MB 292s 40s (7.3X) 40s (7.3X)
![Page 41: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/41.jpg)
Mark Silberstein - UT Austin 41
Results
8CPUs GPU-vanilla GPU-GPUfs
Linux source33,000 files, 524MB
6h 50m (7.2X) 53m (6.8X)
Shakespeare1 file, 6MB 292s 40s (7.3X) 40s (7.3X)
Unboundedinput/outputsize support
8% overhead
![Page 42: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/42.jpg)
Mark Silberstein - UT Austin 42
GPUfsCPU
GPU
CPU GPU
Code is available for download at:https://sites.google.com/site/silbersteinmark/Home/gpufs
http://goo.gl/ofJ6J
GPUfs is the first system to provide native accessto host OS services from GPU programs
![Page 43: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/43.jpg)
Mark Silberstein - UT Austin 43
Our life would have been easier with
● PCI atomics● Preemptive background daemons ● GPU-CPU signaling support● In-GPU exceptions● GPU virtual memory API (host-based or device)● Compiler optimizations for register-heavy
libraries● Seems like accomplished in 5.0
![Page 44: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/44.jpg)
Mark Silberstein - UT Austin 44
CPU
CPU
Sequential access to file:3 versions
GPU file I/O
CUDA pipelined transfer
Read chunk Transfer to GPU Read chunk Transfer to GPU
Read chunk Transfer to GPU Read chunk Transfer to GPU
CUDA whole file transfer
GPU
gmmap() Read file Transfer to GPU
![Page 45: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/45.jpg)
Mark Silberstein - UT Austin 45
16K 64K 256K 512K 1M 2M0
500
1000
1500
2000
2500
3000
3500
4000
GPU File I/O CUDA whole file CUDA pipeline
Page size
Thr
ough
put (
MB
/s)
Sequential readThroughput vs. Page size
![Page 46: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/46.jpg)
Mark Silberstein - UT Austin 46
16K 64K 256K 512K 1M 2M0
500
1000
1500
2000
2500
3000
3500
4000
GPU File I/O CUDA whole file CUDA pipeline
Page size
Thr
ough
put (
MB
/s)
Sequential readThroughput vs. Page size
Benefit: Decouple performance constraints
from application logic
![Page 47: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/47.jpg)
Mark Silberstein - UT Austin 47
Accelerators
as
peers
Accelerators
as
co-processors
On-accelerator OS support
Yesterday Tomorrow
![Page 48: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/48.jpg)
Mark Silberstein - UT Austin 48
Accelerators
as co-processors
?
What about software?
CPU
GPU
CPU GPU
Tomorrow
Accelerators
as peers
Yesterday
![Page 49: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/49.jpg)
Mark Silberstein - UT Austin 49
Set GPUs free!
![Page 50: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/50.jpg)
Mark Silberstein - UT Austin 50
Parallel square root on GPU
gpu_thread(thread_id i){
float buffer;
int fd=gopen(filename,O_GRDWR);
offset=sizeof(float)*i;
gread(fd,sizeof(float),&buffer,offset);
buffer=sqrt(buffer);
gwrite(fd,sizeof(float),&buffer,offset);
gclose(fd);
}
Same code will run in all thousands of
the GPU threads
![Page 51: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/51.jpg)
Mark Silberstein - UT Austin 51
GPUfs impact on GPU programs
Memory overhead● Register pressure
● Very little CPU coding● Makes exitless GPU kernels possible
Pay-as-you-go design
![Page 52: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/52.jpg)
Mark Silberstein - UT Austin 52
Preserve CPU semantics?
GPU threads are different
from CPU threads
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
What does it mean to open/read/write/close/mmap a file
in thousands of threads?
![Page 53: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/53.jpg)
Mark Silberstein - UT Austin 53
Preserve CPU semantics?
GPU threads are different
from CPU threads
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
GPU kernel is a single data-parallel
application
What does it mean to open/read/write/close/mmap a file
in thousands of threads?
![Page 54: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/54.jpg)
Mark Silberstein - UT Austin 54
GPUfs semantics(see more discussion in the paper)
int fd=gopen(“filename”,O_GRDWR);
One file descriptor per file:
open()/close() cached on a GPU
One call per SIMD vector:bulk-synchronous
cooperative execution
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
SIMD vector
Th re a d
Th re a d
Th re a d
Th re a d
![Page 55: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/55.jpg)
Mark Silberstein - UT Austin 55
GPU hardware characteristics
Parallelism
Heterogeneous memory
![Page 56: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/56.jpg)
Mark Silberstein - UT Austin 56
API semantics
int fd=gopen(“filename”,O_GRDWR);
![Page 57: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/57.jpg)
Mark Silberstein - UT Austin 57
API semantics
int fd=gopen(“filename”,O_GRDWR);
CPU
int fd=gopen(“filename”,O_GRDWR);
![Page 58: GPUfs: Integrating a file system with GPUs · CPU-GPU CPU GPUs Accelerated applications OS. Mark Silberstein - UT Austin 4 Software-hardware gap is widening Manycore processors FPGA](https://reader033.vdocuments.mx/reader033/viewer/2022060317/5f0c56767e708231d434e738/html5/thumbnails/58.jpg)
Mark Silberstein - UT Austin 58
This code runs in 100,000 GPU threads
int fd=gopen(“filename”,O_GRDWR);
CPU≠GPU
int fd=gopen(“filename”,O_GRDWR);