communication-minimizing 2d convolution in gpu registers

Post on 24-Feb-2016

50 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Communication-Minimizing 2D Convolution in GPU Registers . Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer. forresti@eecs.berkeley.edu. University of California, Berkeley. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

1

Communication-Minimizing 2D Convolution in GPU Registers

Forrest N. Iandola David SheffieldMichael AndersonP. Mangpo Phothilimthana Kurt Keutzer

University of California, Berkeley

forresti@eecs.berkeley.edu

2Forrest Iandola forresti@eecs.berkeley.edu

Overview

• Convolution is a recurring computational pattern in a broad range of computer vision applications

• Memory communication is the bottleneck for convolution on modern GPUs

• How to minimize memory communication overhead in convolution– Texture cache– Loop blocking

• Up to 4.5x speedup over existing GPU implementations from NVIDIA, OpenCV, and others

3

Why focus on convolution?• Berkeley ParLab project identified 15 recurring

computational patterns in computer vision

Forrest Iandola forresti@eecs.berkeley.edu

• Small filters (2x2 – 7x7)• Feature extraction • Sliding-window object

detection

• If we want fast computer vision, we need fast convolution

CVPR 2007 – 2011

object recognition track

15 C

ompu

ter V

isio

n P

atte

rns

4Forrest Iandola forresti@eecs.berkeley.edu

What limits the performance of convolution?

• Roofline model [1] divides a program’s execution time into two parts:– Computational cost (GFLOPS/s)– Communication cost (GB/s) – memory traffic, I/O, etc.

• No program can outperform the hardware bound on computation or communication

[1] S. Williams, A. Waterman, D. Patterson. Roofline: An Insightful Visual Performance Model for Floating Point Programs and Multicore Architectures. Communications of the ACM, 2009

5

What limits the performance of convolution?

Roofline Model of computational performance

Forrest Iandola forresti@eecs.berkeley.edu

Fast

Slow

Memory Bounded Computation Bounded

6Forrest Iandola forresti@eecs.berkeley.edu

What limits the performance of convolution?

• Convolution on NVIDIA GPUs:– Communication between the GPU’s off-chip DRAM and

on-chip caches is the bottleneck– This doesn’t include communication between the CPU

and GPU, though this can also be an issue• If we want fast computer vision, we need fast

convolution.• If we want fast convolution on GPUs, we need

to optimize memory communication.

7

Exploiting the GPU Memory Architecture

GPU Global Memory (DRAM)

Texture Cache

Registers L1 Cache / Shared Memory

893 GB/s

123 GB/s

Memory per GPU Multiprocessor

129 Gtexels/s

CPU DRAM NVIDIA GTX6808 GB/s

L2 Cache

Optimization 1:

Use the Texture Cache

Threads

8

Data Reuse with Loop Blocking

Typical Implementation: no data reuse at the register level

Forrest Iandola forresti@eecs.berkeley.edu

9 input pixels

1 output pixel

9Forrest Iandola forresti@eecs.berkeley.edu

Data Reuse with Loop Blocking

Our approach: reuse data by doing more work per thread

Optimization 2:

Block the image in registers

9 input pixels

1 output pixel

16 input pixels

4 output pixels

4 inputs per output

Typical Implementation: no data reuse at the register level

11

Comparison with Related Work

NVIDIA GTX680(Kepler)

Inverse roofline model

12

Comparison with Related Work

NVIDIA GTX680(Kepler)

With texture cacheand blocking (ours)

13

Comparison with Related Work

NVIDIA GTX680(Kepler)

14

Comparison with Related Work

NVIDIA GTX680(Kepler)

15

Comparison with Related Work

NVIDIA GTX680(Kepler)

16

Comparison with Related Work

NVIDIA GTX680(Kepler)

17

Comparison with Related Work

4.5x speedup NVIDIA

GTX680(Kepler)

18Forrest Iandola forresti@eecs.berkeley.edu

Are we done?

• Are we done optimizing memory communication?• I think so. We achieved the memory bandwidth

bound for small filters.

• Future work: optimize computation some more!

19

Conclusions

• If we want fast computer vision, we need fast convolution.

• If we want fast convolution on GPUs, we need to optimize memory communication.

• Up to 4.5x faster than existing GPU languages and libraries

• Download our code! https://github.com/forresti/convolution– Use/modify it for your

language/library/application

Forrest Iandola forresti@eecs.berkeley.edu

top related