separable 2d convolution with polymorphic register files

31
Separable 2D Convolution with Polymorphic Register Files Cătălin Ciobanu Georgi Gaydadjiev Computer Engineering Laboratory Delft University of Technology The Netherlands and Department of Computer Science and Engineering Chalmers University of Technology Sweden

Upload: jalila

Post on 22-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Separable 2D Convolution with Polymorphic Register Files . C ă t ă lin Ciobanu Georgi Gaydadjiev Computer Engineering Laboratory Delft University of Technology The Netherlands and Department of Computer Science and Engineering Chalmers University of Technology Sweden. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Separable 2D Convolution with Polymorphic Register Files

Separable 2D Convolution with Polymorphic Register Files

Cătălin CiobanuGeorgi Gaydadjiev

Computer Engineering LaboratoryDelft University of Technology

The Netherlandsand

Department of Computer Science and EngineeringChalmers University of Technology

Sweden

Page 2: Separable 2D Convolution with Polymorphic Register Files

SIMD register files evolution

2

Earth Simulator 2 (ES2), 2009 NEC SX-9/E/1280M160 (ranked 145 in Top 500, June 2012) Vector Unit: 72 registers, 256 elements each

Intel Sandy Bridge, 2011Advanced Vector Extensions (AVX)16 registers, 256 bits each

IBM Cell BE, 2005Cell SPU: 128 registers, 128 bits eachCell PPU Altivec: 32 registers, 128 bits each

Page 3: Separable 2D Convolution with Polymorphic Register Files

Choosing the parameters of the SIMD RF

• Design time: number of registers, their shape/sizes– Programmers are expected to optimize the code accordingly– Next generation designs “may” break software compatibility

• Software is able mask low level architectural details– In domains with efficiency constraints (e.g, HPC), hardware

support is preferable

• Offering a single golden configuration is often impossible as new workloads will emerge for sure

3

Page 4: Separable 2D Convolution with Polymorphic Register Files

Polymorphic Register File architecture

R1

0

4

R2

R9

7

2

3

11

0

13

×

Example:Mx × row vector

Example PRF, 14x8 storage size

vmul R9,R1,R2

Purpose:• Adapt to data structures;• Reduced number of

opcodes, richer instructions semantics;

• Focus on functionality, not on complex data operations / transfers.

Advantages:• Simplified vectorization, 1-

to-1 mapping of registers and data;

• Changing the register number / sizes respects compatibility;

• Improved storage efficiency;• Potential performance gains;• Reduced binary code sizes.

DTYPEHL

R1

R00

4

R3

R9

R0

R1

R2

R3

R4

R5

R6

R7

RN

1

1

1

1

1

1

1

1

0

0

16

24

28

29

31

47

61

-

2RE

1RE

8RE

8RE

4RE

2RE

1RE

2RE

- -

RFORG - RF Organization SPR

R BASE SHAPE VLD

R5

R7

R6

Available space for

more registers

7

R8 1 88 2RE

2

3

5

7

11

0

R8R9 1 104 1RE

13

5

R10 R11

R4

R2

8

8

4

1

2

1

1

2

4

4

R10 1 0 8MD 8

R11 1 7 8SD 8

-

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

INT 64

-

A logical register: Base, Horizontal and Vertical

Length, Data Type & Width

4

Page 5: Separable 2D Convolution with Polymorphic Register Files

Convolution

Used for signal filtering• digital signal processing • image processing • video processing • …Examples:• Gaussian blur filters

– reduce the image noise and detail• Sobel operator

– edge detection algorithms 5

Page 6: Separable 2D Convolution with Polymorphic Register Files

Convolution (continued)

• A “blending” between the input and the mask

• Each output is a weighted sum of its neighbors

• A mask defines the products coefficients – used for all elements of the input array

• No data dependencies – very suitable for SIMD implementations

6

Page 7: Separable 2D Convolution with Polymorphic Register Files

1D Convolution example

7

• Special case for border elements– Apply mask to elements outside the input– Assumptions required for these ”halo” elements– For example: consider all halo elements to be 0

Page 8: Separable 2D Convolution with Polymorphic Register Files

2D Convolution

8

18161413119864

............393735...292725...191715...

M

I

.........18862997...12041873...

O

1204 tosaccumulate ,464378209153

e)(point wis 1614119

29271917

O

Page 9: Separable 2D Convolution with Polymorphic Register Files

Separable 2D Convolution

• computed as two 1D convolutions – row-wise 1D followed by column-wise 1D convolution

• Fewer operations are required• More suitable for blocked SIMD execution

– fewer data dependencies between blocks

9

101by followed 121

101202101

Page 10: Separable 2D Convolution with Polymorphic Register Files

Our Implementation

Separable 2D Convolution• Execute two consecutive 1D convolutions

• Transpose the data while processing

• We only present the first 1D convolution step• Should be executed twice

10

Page 11: Separable 2D Convolution with Polymorphic Register Files

Conflict-free Transposition

• Column-wise Convolution involves strided accesses– may degrade performance due to bank conflicts

Solution: • Vectorized transposition while processing data

– transpose the output of 1st 1D convolution– Conflict-free using Polymorphic RFs– Avoids strided accesses for 2nd 1D convolution

11

Page 12: Separable 2D Convolution with Polymorphic Register Files

Conflict-free Transposition

12

• Result effectively transposed • Full LS bandwidth utilization

– only consecutive addresses

R0

0

4

Available space for

more registers

7

2

3

0 3

R2

R10 R11 R12 R13

R6

R7

R8

R9

R1

R3

21

8

R5

4

5

5 98

R4

6 1110 • R 6 - 9– loaded using 1D

accesses• R 10 – 13

– stored using 1D accesses

Page 13: Separable 2D Convolution with Polymorphic Register Files

Vectorized Separable 2D Convolution

• We separate the algorithm in three parts– first (left-most)– main (middle)– last (right-most)

• 2D vectorization– Data is be processed multiple rows at a time

• Our examples: blocks with 4 rows, 6 columns

13

Page 14: Separable 2D Convolution with Polymorphic Register Files

Three Separate Convolution Phases

14

R0

0

4

Available space for

more registers

7

2

3

0 3

R2

R10 R11 R12 R13

R6

R7

R8

R9

R1

R3

21

8

R5

4

5

5 98

R4

6 1110

FirstR0

0

4

Available space for

more registers

7

2

3

0 3

R2

R10 R11 R12 R13

R6

R7

R8

R9

R1

R3

21

8

R5

4

5

5 98R4

6

R14 R15

10 11

R0

0

4

Available space for

more registers

7

2

3

0 3

R2

R10 R11 R12 R13

R6

R7

R8

R9

R1

R3

21

8

4

5

5 98R4

6

R14 R15

10 11

R5

R16 R17

Main

Last Customize the PRF– Runtime customization– Only logic registers resizing– Instructions not modified

Page 15: Separable 2D Convolution with Polymorphic Register Files

Register Assignments

15

R0

0

4

Available space for

more registers

7

2

3

0 3

R2

R10 R11 R12 R13

R6

R7

R8

R9

R1

R3

21

8

R5

4

5

5 98

R4

6 1110

• R1: input data

– Overlaps with R6-R9• R2: the mask• R3: convolution result

– Overlaps with R10-R17 • R0: left hallo cells• R4: halo + loaded data• R5: right halo for next block

Page 16: Separable 2D Convolution with Polymorphic Register Files

Throughput Comparison – NVIDIA C2050

NVIDIA Tesla C2050 GPU• State of the art Fermi architecture• 448 SIMD lanes running at 1.15 GHz

– 14 Streaming Multiprocessors, 32 SIMD lanes each• 3GB off-chip GDDR5 @ 1.5GHz

– 384-bit wide,144GB/s• Power consumption of 247 Watts • 64KB L1 cache, 768KB unified L2 cache

16

Page 17: Separable 2D Convolution with Polymorphic Register Files

Throughput Comparison – PRF

Polymorphic Register File (PRF)• Same clock frequency as the C2050 GPU assumed

– Realistic based on our ASIC synthesis results• Up to 256 SIMD lanes• Private Local Store (LS), 11 cycles latency• Multiple LS bandwidth scenarios

– 16 bytes / Cycle (the same as Cell SPU) up to 256 bytes /cycle

• Blocked Separable 2D convolution implementation– 32 x 32 elements block size

17

Page 18: Separable 2D Convolution with Polymorphic Register Files

Constrained Local Store BW: 16 B / cycle

18

3 x 3 5 x 5 9 x 9 17 x 17 33 x 334

40

400

4000

NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes64 Lanes 128 Lanes 256 Lanes

Mask Size

Pixe

ls /

1000

Cyc

les

3 x 3 5 x 5 9 x 9 17 x 17 33 x 330

20

40

60

80

1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes

Mask Size

Rela

tive

Thro

ughp

ut In

-cr

ease

(%)

Saturates for > 4 lanes Outperform GPU

Page 19: Separable 2D Convolution with Polymorphic Register Files

Local Store BW: 32B/cycle

19

3 x 3 5 x 5 9 x 9 17 x 17 33 x 334

40

400

4000

NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes

Mask Size

Pixe

ls /

1000

Cyc

les

3 x 3 5 x 5 9 x 9 17 x 17 33 x 330

20

40

60

80

1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes

Mask Size

Rela

tive

Thro

ughp

ut

Incr

ease

(%)

8 lanes match GPU throughput

Page 20: Separable 2D Convolution with Polymorphic Register Files

Local Store BW: 64B / Cycle

20

3 x 3 5 x 5 9 x 9 17 x 17 33 x 334

40

400

4000

NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes

Mask Size

Pixe

ls /

1000

Cyc

les

3 x 3 5 x 5 9 x 9 17 x 17 33 x 330

20

40

60

80

1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes

Mask SizeRela

tive

Thro

ughp

ut

Incr

ease

(%)

2X than for 16B/Cycle 4 lanes match

GPU throughput

Page 21: Separable 2D Convolution with Polymorphic Register Files

Local Store BW: 128 B / Cycle

21

3 x 3 5 x 5 9 x 9 17 x 17 33 x 334

40

400

4000

NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes

Mask Size

Pixe

ls /

1000

Cyc

les

3 x 3 5 x 5 9 x 9 17 x 17 33 x 330

20

40

60

80

1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes

Mask Size

Rela

tive

Thro

ughp

ut In

-cr

ease

(%)

Improvement mostly for > 32 lanes

Page 22: Separable 2D Convolution with Polymorphic Register Files

Local Store BW: 256B / Cycle

22

3 x 3 5 x 5 9 x 9 17 x 17 33 x 334

40

400

4000

NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes

Mask Size

Pixe

ls /

1000

Cyc

les

3 x 3 5 x 5 9 x 9 17 x 17 33 x 330

20

40

60

80

1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes

Mask Size

Rela

tive

Thro

ughp

ut

Incr

ease

(%)

Page 23: Separable 2D Convolution with Polymorphic Register Files

SIMD lanes and LS BW Efficiency Summary

Mask Size /

LS BW3 × 3 5 × 5 9 × 9 17 × 17 33 × 33

16 2 4 8 16 3232 4 8 8 16 3264 4 8 16 32 32

128, 256 4 8 16 32 64

23

SIMD lanes range providing at least 75% efficiency

If PRF implemented in FPGA technology• Dynamically adjust #vector lanes during runtime• Switch off unused lanes to save power• Customize LS BW for high performance or power savings

Page 24: Separable 2D Convolution with Polymorphic Register Files

Conclusions

• PRFs outperform the NVIDIA Tesla GPU for 2D Convolution with masks of 9 × 9 or larger– even in bandwidth constrained systems

• Large mask sizes allow the efficient use of more PRF vector lanes

• For small mask sizes, LS bandwidth is the main bottleneck

• PRFs reduced the effort required to vectorize each Convolution execution phase– Simplified to resizing the PRF registers on demand

24

Page 25: Separable 2D Convolution with Polymorphic Register Files

Thank you!

Questions?

25

Page 26: Separable 2D Convolution with Polymorphic Register Files

Unified assembly vector instructions

Unified opcodes: multiplication • Matrix x Vector

vmul R3, R0, R2• Vector x Vector (main diag.)

vmul R5, R1, R4

• Integer / Floating point 8/16/32/64-bit

The micro-architecture will perform the compatibility checks and raise exceptions

DTYPEHL

R4

0

4

R2R0

R1

R2

R3

R4

RN

1

1

1

1

1

0

0

0

4

5

24

-

4RE

4MD

4RE

4RE

1RE

- -

RFORG - RF Organization SPR

R BASE SHAPE VLD

R5

0

R1 R0

4

4

1

1

4

-

4

R3

5

R5 1 30 1RE 1

Available space for

more registers5

FLOAT 64

FLOAT 64

FLOAT 64

FLOAT 64

FLOAT 64

FLOAT 64

-

26

Page 27: Separable 2D Convolution with Polymorphic Register Files

The bandwidth utilization problem

0

2

4

6

1

2

3

47

R3

0

1

2

0

2

R43

0 1

04 70

1 2

R1

00

1

2

7

7 00

1

2

77

R10

R11

0

1

7R3

0

1

0

R43

0 1

070

1

R1

00

1

77

R11

00

1

77

R10

Poor memory bandwidth utilization

ReO scheme

Optimal

Page 28: Separable 2D Convolution with Polymorphic Register Files

ASIC PRF implementation overview

28

• TSMC 90nm technology• Synthesis tool: Synopsys Design Compiler Ultra F-2011.09-SP3• Artisan memory compiler (1GHz, 256x64-bit dualport SRAM as storage element)• 64 bit data width• Full crossbars as read and write shuffle blocks• 2R/1W ports• Four multi-lane configurations

• 8 / 16 / 32 / 64 lanes • Three PRF sizes:

• 32KB (64x64), 128KB (128x128), 512KB (256x256) • Clock frequency: 500 MHz - 970MHz• Dynamic Power: 300mW – 8.7W• Leakage: 10mW – 276mW

• Customized configurations• Up to 21% higher clock frequency• Up to 39% combinational hardware area reduction• Up to 10% reduction in total area• Reduced dynamic power by up to 31%, leakage with nearly 24%

Page 29: Separable 2D Convolution with Polymorphic Register Files

Customized linear addressing functions

29

standardA ,i M j j Mp q q

customizedA ,i ji M jc c j Mp q q

It is possible to determine the linear address by only examining the upper left corner of the block being accessed for each memory module (k, l):

and coefficients:• depend on the MAFs and shape/size of the accesses, being are different for

each of the selected schemes • The inverse MAF is required: 1 , , ,m i j k l

jcic

The PRF contains data elements, supporting up to parallel vector lanesN M p q

Page 30: Separable 2D Convolution with Polymorphic Register Files

Multi-view parallel access schemes

Scheme Conflict Free Parallel Accesses MAF row ( mrow ) MAF column ( mcol )

Rectangle Only (ReO)Rectangle Row (ReRo)

(p,q+1) and (p,q-1) co-prime

Rectangle Column (ReCo)(p+1,q) and (p-1,q)

co-prime

Row Column (RoCo)

Rect Transposed Rect (ReTr)

• Conflict-free parallel access for at least two rectangular shapes• Relaxes the p × q rectangle limitation of the ReO scheme

rectanglep qrectangle,

row, main andsecondary diagonals

p qp q p q

rectangle, column, main

and secondary diagonals

p qp q p q

row, column,

aligned % 0 or % 0

rectangle

p q p qi p j q

p q

and rectanglesp q q p

%i p %j q

%j q

%i p

%ji pq

%i j qp

%ji pq

%i j q

p

%i p % %i i p j q , % 0p q q p

30

Page 31: Separable 2D Convolution with Polymorphic Register Files

Implementation diagram

Read / Write Data Shuffle

M02 M03

M10 M11 M12

M00 M01

M13

Read / Write Address Shuffle

AGU - compute i + α, j + β

A(i + α, j + β)

m(i + α, j + β)

i j Access TypeAddressData

Read Delay

ij

Access Type

ci, cj (i, j, Access Type) & A(ci, cj)

• Data is distributed among p x q memory modules• The AGU computes the addresses of all involved

elements• The generated addresses are fed to the Module

Assignment Function (MAF), which controls the read and write shuffles

Standard case: addresses need to be reordered according to the MAF before being sent to the memory modules

Customized case: eliminate the need to shuffle the read and write intra-module addresses.The shaded blocks are replaced by the ci, cj coefficients as well as the customized addressing function

31