gpufs: integrating a file system with gpus silberstein - ut austin 1 gpufs: integrating a file...

Mark Silberstein - UT Austin 1

GPUfs:Integrating a file system with

GPUs

Mark Silberstein(UT Austin/Technion)

Bryan Ford (Yale), Idit Keidar (Technion)Emmett Witchel (UT Austin)

ASPLOS 2013


Traditional System Architecture

Applications

OS

CPU


Modern System Architecture

Manycoreprocessors

FPGAHybrid

CPU-GPUGPUsCPU

Accelerated applications

OS


Software-hardware gap is widening

Manycoreprocessors

FPGAHybrid

CPU-GPUGPUsCPU


OS


Software-hardware gap is widening

Manycoreprocessors

FPGAHybrid

CPU-GPUGPUsCPU


OSAd-hoc abstractions and management mechanisms


On-accelerator OS support closes the programmability gap

Manycoreprocessors

FPGAHybrid

CPU-GPUGPUsCPU


OS On-accelerator OS support

Native accelerator applications

Coordination


● GPUfs: File I/O support for GPUs● Motivation● Goals● Understanding the hardware● Design● Implementation● Evaluation


Building systems with GPUs is hard.Why?


Data transfersGPU invocation

Memory management

Goal of GPU programming frameworks

GPU

Parallel Algorithm

CPU


Headache for GPU programmers

Parallel Algorithm

GPU

Data transfersInvocation

Memory management

CPU

Half of the CUDA SDK 4.1 samples:at least 9 CPU LOC per 1 GPU LOC


GPU kernels are isolated

Parallel Algorithm

GPU

Data transfersInvocation

Memory management

CPU


Example: accelerating photo collage

http://www.codeproject.com/Articles/36347/Face-Collage

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers()}


CPU Implementation

CPUCPUCPU Application



Offloading computations to GPU

CPUCPUCPU Application


Move to GPU


Offloading computations to GPU

GPU

CPU

Kernel start

Datatransfer

Kernel termination

Co-processor programming model


Kernel start/stop overheads

CPU

GPU

copy to

GP

Uco

py to

CP

U

invoke

Invocationlatency

Synchronization

Cache flush


Hiding the overheads

CPU

GPU

copy to

GP

Uco

py to

CP

U

invoke

Manual data reuse managementAsynchronous invocation

Double buffering

copy to

GP

U


Implementation complexity

CPU

GPU

copy to

GP

Uco

py to

CP

U

invoke


Double buffering

copy to

GP

U

Management overhead


Implementation complexity

CPU

GPU

copy to

GP

Uco

py to

CP

U

invoke


Double buffering

copy to

GP

U

Why do we need to deal withlow-level system details?

Management overhead


The reason is....

GPUs are peer-processors

They need I/O OS services


GPUfs: application viewCPUs GPU1 GPU2 GPU3

open(“shared_file”)

mm

ap()

open(“shared_file”)w

rite(

)

Host File System

GPUfs


GPUfs: application viewCPUs GPU1 GPU2 GPU3

open(“shared_file”)

mm

ap()

open(“shared_file”)w

rite(

)

Host File System

GPUfs

System-wideshared

namespace

Persistentstorage

POSIX (CPU)-like API


Accelerating collage app with GPUfs

CPUCPUCPU

GPUfsGPUfs

open/read from GPU

GPU

No CPUmanagement code


CPUCPUCPU

GPUfs buffer cacheGPUfs

GPU

GPUfs

OverlappingOverlapping computations and transfers

Read-ahead



CPUCPUCPU

GPUfs

GPU

Data reuse


Random data access


Challenge

GPU ≠ CPU


Massive parallelism

NVIDIA Fermi* AMD HD5870*

From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*

23,000 active threads

31,000 active threads

Parallelism is essential for performance in deeply multi-threaded wide-vector hardware


Heterogeneous memory

CPU GPU

Memory Memory

10-32GB/s

6-16 GB/s

288-360GB/s

~x20

GPUs inherently impose high bandwidth demands on memory


How to build an FS layer on this hardware?


GPUfs: principled redesign of the whole file system stack

● Relaxed FS API semantics for parallelism

● Relaxed FS consistency for heterogeneous memory

● GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, ….


GPU applicationusing GPUfs File API

OS File System Interface

GPUfs high-level design

GPU Memory(Page cache)CPU Memory

GPUfs Distributed Buffer Cache

Unchanged applicationsusing OS File API

GPUfs hooks GPUfs GPU File I/O library

OS

CPU GPU

Disk

Host File System

Massiveparallelism

Heterogeneousmemory


GPU applicationusing GPUfs File API

OS File System Interface

GPUfs high-level design

GPU Memory(Page cache)CPU Memory

GPUfs Distributed Buffer Cache

Unchanged applicationsusing OS File API

GPUfs hooks GPUfs GPU File I/O library

OS

CPU GPU

Disk

Host File System


Buffer cache semantics

Local or Distributed file systemdata consistency?


GPUfs buffer cacheWeak data consistency model

● close(sync)-to-open semantics (AFS)

write(1)

open() read(1)

GPU1

GPU2

fsync() write(2)

Not visible to CPU

Remote-to-Local memory performanceratio is similar to

a distributed system

>>


On-GPU File I/O API

open/close

read/write

mmap/munmap

fsync/msync

ftrunc

gopen/gclose

gread/gwrite

gmmap/gmunmap

gfsync/gmsync

gftrunc

In th

e p

aper

Changes in the semantics are crucial


Implementation bits

● Paging support ● Dynamic data structures and memory

allocators● Lock-free radix tree● Inter-processor communications (IPC)● Hybrid H/W-S/W barriers● Consistency module in the OS kernel

In t h

e p a

per

~1,5K GPU LOC, ~600 CPU LOC


Evaluation

All benchmarks are written as a GPU

kernel: no CPU-side development


Matrix-vector product(Inputs/Outputs in files)Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075

280 560 2800 5600 112000

500

1000

1500

2000

2500

3000

3500CUDA piplined CUDA optimized GPU file I/O

Input matrix size (MB)

Th

rou

gh

pu

t (M

B/s

)


Word frequency count in text

● Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree

ChallengesDynamic working setSmall filesLots of file I/O (33,000 files,1-5KB each)Unpredictable output size

English dictionary: 58,000 words


Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare1 file, 6MB 292s 40s (7.3X) 40s (7.3X)


Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare1 file, 6MB 292s 40s (7.3X) 40s (7.3X)

Unboundedinput/outputsize support

8% overhead


GPUfsCPU

GPU

CPU GPU

Code is available for download at:https://sites.google.com/site/silbersteinmark/Home/gpufs

http://goo.gl/ofJ6J

GPUfs is the first system to provide native accessto host OS services from GPU programs

https://sites.google.com/site/silbersteinmark/Home/gpufs

gpufs: integrating a file system with gpus silberstein - ut austin 1 gpufs: integrating a file...

Documents