programming the gpu with array-oriented syntax in python | gtc … · 2013. 3. 22. · programming...

24
Programming the GPU with Array- Oriented Syntax in Python GTC 2013 March 21, 2013 Travis E. Oliphant Thursday, March 21, 13

Upload: others

Post on 21-Mar-2021

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Programming the GPU with Array-Oriented Syntax in Python

GTC 2013March 21, 2013

Travis E. Oliphant

Thursday, March 21, 13

Page 2: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Inroduction

AfterBefore

⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f)Uk,l (a, f)],j

Thursday, March 21, 13

Page 3: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Big Picture

Empower domain experts with high-level tools that exploit modern hard-ware

Array Oriented Computing

?

Thursday, March 21, 13

Page 4: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Why Python

• Syntax is accessible (like Matlab, R, IDL, Mathematica, etc.)• Can still “build a system” with it• Powerful libraries for STEM (NumPy, SciPy, SymPy, ...)• Great visualization tools (Matplotlib, Chaco, Visit, Mayavi, ...)• Rapid development with interactive coding- Extends your short-term memory- Quick reference to functionality- Allows you to fail quickly

• Commercial support• Large community

Thursday, March 21, 13

Page 5: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Breaking the Speed Barrier (Numba!)

Numba is open source and freely available for producing sequential code. Can speed up “typed”

Python (i.e. Python using typed containers)

http://numba.pydata.org

rapid iteration and development+

fast code executionideal combination!=

Thursday, March 21, 13

Page 6: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

NumPy + Mamba = Numba

LLVM Library

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVM-PY

Python Function Machine Code

ARM

Thursday, March 21, 13

Page 7: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Compiler Overview

LLVM IR

x86

C++

ARM

PTX

C

Fortran

Python

Numba turns Python into a “compiled language”

PPC

Numba LLVM

Thursday, March 21, 13

Page 8: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Example

Numba

Thursday, March 21, 13

Page 9: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Image Processing

from numba import jit

@jit('void(f8[:,:],f8[:,:],f8[:,:])')def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result

~1500x speed-up on CPU

Thursday, March 21, 13

Page 10: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Thursday, March 21, 13

Page 11: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Let’s target GPUs

Python and NumPy compiled to Parallel Architectures (GPUs and multi-core machines)

• Fast Vectorize for the GPU• Create parallel-for loops (multiple

threads on CPU)• Parallel execution of ufuncs (on

CPU)• CUDA Python: write CUDA

directly in Python!fast development and fast execution!

Thursday, March 21, 13

Page 12: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Create parallel-for loops“prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma)

from numbapro import autojit, prange

@autojitdef parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

Thursday, March 21, 13

Page 13: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Fast vectorize

NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays

Write kernels in Python!from numbapro import vectorizefrom math import sin

@vectorize([‘f8(f8)’, ‘f4(f4)’])def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Thursday, March 21, 13

Page 14: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Ufuncs in parallel (multi-thread or GPU)from numbapro import vectorizefrom math import sin

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’)def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Thursday, March 21, 13

Page 15: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Example: Mandelbrot Vectorizedfrom numbapro import vectorize

sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)'

@vectorize([sig], target='gpu')def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height

x = tid % width y = tid / width

real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y

c = complex(real, imag) z = 0.0j

for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255

Kind Time Speed-up

Python 263.6 1.0x

CPU 2.639 100x

GPU 0.1676 1573x

Tesla S2050

Thursday, March 21, 13

Page 16: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Introducing CUDA-Pythonfrom numbapro import cudafrom numba import autojit

@autojit(target=‘gpu’)def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x

i = tid + blkid * blkdim

if i >= n: return

dst[i] = src[i] * scale

src = np.arange(N, dtype=np.float)dst = np.empty_like(src)

array_scale[grid, block](src, dst, 5.0)

CUDA Developmentusing Python syntax for optimial performance!

Thursday, March 21, 13

Page 17: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Example: Black-Scholes

@cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double))def black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2)

expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) putResult[i] = (X[i] * expRT * (1.0 - cndd2) -

S[i] * (1.0 - cndd1))

@cuda.jit(argtypes=(double,), restype=double, device=True, inline=True)def cnd_cuda(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val

blockdim = 1024, 1griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1stream = cuda.stream()d_callResult = cuda.to_device(callResultNumbapro, stream)d_putResult = cuda.to_device(putResultNumbapro, stream)d_stockPrice = cuda.to_device(stockPrice, stream)d_optionStrike = cuda.to_device(optionStrike, stream)d_optionYears = cuda.to_device(optionYears, stream)for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize()

Thursday, March 21, 13

Page 18: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Black-Scholes: Results

core i7 GeForce GTX 560 Ti

About 9x faster on this GPU

About 70% CUDA-C speed

Thursday, March 21, 13

Page 19: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Example: Matrix [email protected](argtypes=[f4[:,:], f4[:,:], f4[:,:]])def cu_square_matrix_mul(A, B, C): sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y

x = tx + bx * bw y = ty + by * bh

acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x]

cuda.syncthreads()

if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx]

cuda.syncthreads()

if x < n and y < n: C[y, x] = acc

bpg = 50tpb = 32n = bpg * tpb

A = np.array(np.random.random((n, n)), dtype=np.float32)B = np.array(np.random.random((n, n)), dtype=np.float32)C = np.empty_like(A)

stream = cuda.stream()with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)

Thursday, March 21, 13

Page 20: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Old Performance Results

6x faster on the GPU.

GeForce GTX 560 Ticore i7

90% of CUDA-C result

Thursday, March 21, 13

Page 21: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Roadmap

•Adding cuBLAS, cuFFT, cuSparse support (cuRAND available)•Adding more CUDA support•Improving high-level API•Improving the output of the compiler to match code that CUDA-C is producing

Thursday, March 21, 13

Page 22: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

How to get it?

•Available in Anaconda Accelerate at www.continuum.io --- conda install accelerate

•Contact [email protected] for inquiries, site-licenses, training, and additional support.

•On wakari.io --- cloud based Python

Thursday, March 21, 13

Page 23: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

Introducing GPU taskqueue in Wakari

• Complete Python environment in the browser• Hosted on EC2 with fast access to Amazon S3• Private Cloud or internal server installation• Sandboxed Linux environment with SSH• Rich interactive Javascript plotting and Matplotlib• Python v2.6-3.3, NumPy, SciPy, Pandas• Share IPython Notebooks with others• One-click switching between versions of Python• Use multiple versions of libraries at the same time• Compilers & tools for installing advanced libraries• GPU and high-memory compute nodes

Free plans available!

With access to GPU task queue!

Thursday, March 21, 13

Page 24: Programming the GPU With Array-Oriented Syntax In Python | GTC … · 2013. 3. 22. · Programming the GPU with Array-Oriented Syntax in Python GTC 2013 March 21, 2013 ... NumPy +

TaskQueue --- Try your code on GPUs in the cloud

[~]taskqueue submit mandel/mandel_vectorize.py[~]taskqueue status[~]taskqueue fetch

•http://www.wakari.io•Get Free account•Accelerate available in cloud•CUDA 5•Upload or edit your files•Open Shell and submit jobs

Beta!Early release with only

command-line interface.

Thursday, March 21, 13