separable 2d convolution with polymorphic register files
DESCRIPTION
Separable 2D Convolution with Polymorphic Register Files . C ă t ă lin Ciobanu Georgi Gaydadjiev Computer Engineering Laboratory Delft University of Technology The Netherlands and Department of Computer Science and Engineering Chalmers University of Technology Sweden. - PowerPoint PPT PresentationTRANSCRIPT
Separable 2D Convolution with Polymorphic Register Files
Cătălin CiobanuGeorgi Gaydadjiev
Computer Engineering LaboratoryDelft University of Technology
The Netherlandsand
Department of Computer Science and EngineeringChalmers University of Technology
Sweden
SIMD register files evolution
2
Earth Simulator 2 (ES2), 2009 NEC SX-9/E/1280M160 (ranked 145 in Top 500, June 2012) Vector Unit: 72 registers, 256 elements each
Intel Sandy Bridge, 2011Advanced Vector Extensions (AVX)16 registers, 256 bits each
IBM Cell BE, 2005Cell SPU: 128 registers, 128 bits eachCell PPU Altivec: 32 registers, 128 bits each
Choosing the parameters of the SIMD RF
• Design time: number of registers, their shape/sizes– Programmers are expected to optimize the code accordingly– Next generation designs “may” break software compatibility
• Software is able mask low level architectural details– In domains with efficiency constraints (e.g, HPC), hardware
support is preferable
• Offering a single golden configuration is often impossible as new workloads will emerge for sure
3
Polymorphic Register File architecture
R1
0
4
R2
R9
7
2
3
11
0
13
×
Example:Mx × row vector
Example PRF, 14x8 storage size
vmul R9,R1,R2
Purpose:• Adapt to data structures;• Reduced number of
opcodes, richer instructions semantics;
• Focus on functionality, not on complex data operations / transfers.
Advantages:• Simplified vectorization, 1-
to-1 mapping of registers and data;
• Changing the register number / sizes respects compatibility;
• Improved storage efficiency;• Potential performance gains;• Reduced binary code sizes.
DTYPEHL
R1
R00
4
R3
R9
R0
R1
R2
R3
R4
R5
R6
R7
RN
1
1
1
1
1
1
1
1
0
0
16
24
28
29
31
47
61
-
2RE
1RE
8RE
8RE
4RE
2RE
1RE
2RE
- -
RFORG - RF Organization SPR
R BASE SHAPE VLD
R5
R7
R6
Available space for
more registers
7
R8 1 88 2RE
2
3
5
7
11
0
R8R9 1 104 1RE
13
5
R10 R11
R4
R2
8
8
4
1
2
1
1
2
4
4
R10 1 0 8MD 8
R11 1 7 8SD 8
-
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
INT 64
-
A logical register: Base, Horizontal and Vertical
Length, Data Type & Width
4
Convolution
Used for signal filtering• digital signal processing • image processing • video processing • …Examples:• Gaussian blur filters
– reduce the image noise and detail• Sobel operator
– edge detection algorithms 5
Convolution (continued)
• A “blending” between the input and the mask
• Each output is a weighted sum of its neighbors
• A mask defines the products coefficients – used for all elements of the input array
• No data dependencies – very suitable for SIMD implementations
6
1D Convolution example
7
• Special case for border elements– Apply mask to elements outside the input– Assumptions required for these ”halo” elements– For example: consider all halo elements to be 0
2D Convolution
8
18161413119864
............393735...292725...191715...
M
I
.........18862997...12041873...
O
1204 tosaccumulate ,464378209153
e)(point wis 1614119
29271917
O
Separable 2D Convolution
• computed as two 1D convolutions – row-wise 1D followed by column-wise 1D convolution
• Fewer operations are required• More suitable for blocked SIMD execution
– fewer data dependencies between blocks
9
101by followed 121
101202101
Our Implementation
Separable 2D Convolution• Execute two consecutive 1D convolutions
• Transpose the data while processing
• We only present the first 1D convolution step• Should be executed twice
10
Conflict-free Transposition
• Column-wise Convolution involves strided accesses– may degrade performance due to bank conflicts
Solution: • Vectorized transposition while processing data
– transpose the output of 1st 1D convolution– Conflict-free using Polymorphic RFs– Avoids strided accesses for 2nd 1D convolution
11
Conflict-free Transposition
12
• Result effectively transposed • Full LS bandwidth utilization
– only consecutive addresses
R0
0
4
Available space for
more registers
7
2
3
0 3
R2
R10 R11 R12 R13
R6
R7
R8
R9
R1
R3
21
8
R5
4
5
5 98
R4
6 1110 • R 6 - 9– loaded using 1D
accesses• R 10 – 13
– stored using 1D accesses
Vectorized Separable 2D Convolution
• We separate the algorithm in three parts– first (left-most)– main (middle)– last (right-most)
• 2D vectorization– Data is be processed multiple rows at a time
• Our examples: blocks with 4 rows, 6 columns
13
Three Separate Convolution Phases
14
R0
0
4
Available space for
more registers
7
2
3
0 3
R2
R10 R11 R12 R13
R6
R7
R8
R9
R1
R3
21
8
R5
4
5
5 98
R4
6 1110
FirstR0
0
4
Available space for
more registers
7
2
3
0 3
R2
R10 R11 R12 R13
R6
R7
R8
R9
R1
R3
21
8
R5
4
5
5 98R4
6
R14 R15
10 11
R0
0
4
Available space for
more registers
7
2
3
0 3
R2
R10 R11 R12 R13
R6
R7
R8
R9
R1
R3
21
8
4
5
5 98R4
6
R14 R15
10 11
R5
R16 R17
Main
Last Customize the PRF– Runtime customization– Only logic registers resizing– Instructions not modified
Register Assignments
15
R0
0
4
Available space for
more registers
7
2
3
0 3
R2
R10 R11 R12 R13
R6
R7
R8
R9
R1
R3
21
8
R5
4
5
5 98
R4
6 1110
• R1: input data
– Overlaps with R6-R9• R2: the mask• R3: convolution result
– Overlaps with R10-R17 • R0: left hallo cells• R4: halo + loaded data• R5: right halo for next block
Throughput Comparison – NVIDIA C2050
NVIDIA Tesla C2050 GPU• State of the art Fermi architecture• 448 SIMD lanes running at 1.15 GHz
– 14 Streaming Multiprocessors, 32 SIMD lanes each• 3GB off-chip GDDR5 @ 1.5GHz
– 384-bit wide,144GB/s• Power consumption of 247 Watts • 64KB L1 cache, 768KB unified L2 cache
16
Throughput Comparison – PRF
Polymorphic Register File (PRF)• Same clock frequency as the C2050 GPU assumed
– Realistic based on our ASIC synthesis results• Up to 256 SIMD lanes• Private Local Store (LS), 11 cycles latency• Multiple LS bandwidth scenarios
– 16 bytes / Cycle (the same as Cell SPU) up to 256 bytes /cycle
• Blocked Separable 2D convolution implementation– 32 x 32 elements block size
17
Constrained Local Store BW: 16 B / cycle
18
3 x 3 5 x 5 9 x 9 17 x 17 33 x 334
40
400
4000
NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes64 Lanes 128 Lanes 256 Lanes
Mask Size
Pixe
ls /
1000
Cyc
les
3 x 3 5 x 5 9 x 9 17 x 17 33 x 330
20
40
60
80
1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes
Mask Size
Rela
tive
Thro
ughp
ut In
-cr
ease
(%)
Saturates for > 4 lanes Outperform GPU
Local Store BW: 32B/cycle
19
3 x 3 5 x 5 9 x 9 17 x 17 33 x 334
40
400
4000
NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes
Mask Size
Pixe
ls /
1000
Cyc
les
3 x 3 5 x 5 9 x 9 17 x 17 33 x 330
20
40
60
80
1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes
Mask Size
Rela
tive
Thro
ughp
ut
Incr
ease
(%)
8 lanes match GPU throughput
Local Store BW: 64B / Cycle
20
3 x 3 5 x 5 9 x 9 17 x 17 33 x 334
40
400
4000
NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes
Mask Size
Pixe
ls /
1000
Cyc
les
3 x 3 5 x 5 9 x 9 17 x 17 33 x 330
20
40
60
80
1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes
Mask SizeRela
tive
Thro
ughp
ut
Incr
ease
(%)
2X than for 16B/Cycle 4 lanes match
GPU throughput
Local Store BW: 128 B / Cycle
21
3 x 3 5 x 5 9 x 9 17 x 17 33 x 334
40
400
4000
NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes
Mask Size
Pixe
ls /
1000
Cyc
les
3 x 3 5 x 5 9 x 9 17 x 17 33 x 330
20
40
60
80
1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes
Mask Size
Rela
tive
Thro
ughp
ut In
-cr
ease
(%)
Improvement mostly for > 32 lanes
Local Store BW: 256B / Cycle
22
3 x 3 5 x 5 9 x 9 17 x 17 33 x 334
40
400
4000
NVIDIA C2050 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes256 Lanes
Mask Size
Pixe
ls /
1000
Cyc
les
3 x 3 5 x 5 9 x 9 17 x 17 33 x 330
20
40
60
80
1002 Lanes 4 Lanes 8 Lanes 16 Lanes 32 Lanes 64 Lanes 128 Lanes 256 Lanes
Mask Size
Rela
tive
Thro
ughp
ut
Incr
ease
(%)
SIMD lanes and LS BW Efficiency Summary
Mask Size /
LS BW3 × 3 5 × 5 9 × 9 17 × 17 33 × 33
16 2 4 8 16 3232 4 8 8 16 3264 4 8 16 32 32
128, 256 4 8 16 32 64
23
SIMD lanes range providing at least 75% efficiency
If PRF implemented in FPGA technology• Dynamically adjust #vector lanes during runtime• Switch off unused lanes to save power• Customize LS BW for high performance or power savings
Conclusions
• PRFs outperform the NVIDIA Tesla GPU for 2D Convolution with masks of 9 × 9 or larger– even in bandwidth constrained systems
• Large mask sizes allow the efficient use of more PRF vector lanes
• For small mask sizes, LS bandwidth is the main bottleneck
• PRFs reduced the effort required to vectorize each Convolution execution phase– Simplified to resizing the PRF registers on demand
24
Thank you!
Questions?
25
Unified assembly vector instructions
Unified opcodes: multiplication • Matrix x Vector
vmul R3, R0, R2• Vector x Vector (main diag.)
vmul R5, R1, R4
• Integer / Floating point 8/16/32/64-bit
The micro-architecture will perform the compatibility checks and raise exceptions
DTYPEHL
R4
0
4
R2R0
R1
R2
R3
R4
RN
1
1
1
1
1
0
0
0
4
5
24
-
4RE
4MD
4RE
4RE
1RE
- -
RFORG - RF Organization SPR
R BASE SHAPE VLD
R5
0
R1 R0
4
4
1
1
4
-
4
R3
5
R5 1 30 1RE 1
Available space for
more registers5
FLOAT 64
FLOAT 64
FLOAT 64
FLOAT 64
FLOAT 64
FLOAT 64
-
26
The bandwidth utilization problem
0
2
4
6
1
2
3
47
R3
0
1
2
0
2
R43
0 1
04 70
1 2
R1
00
1
2
7
7 00
1
2
77
R10
R11
0
1
7R3
0
1
0
R43
0 1
070
1
R1
00
1
77
R11
00
1
77
R10
Poor memory bandwidth utilization
ReO scheme
Optimal
ASIC PRF implementation overview
28
• TSMC 90nm technology• Synthesis tool: Synopsys Design Compiler Ultra F-2011.09-SP3• Artisan memory compiler (1GHz, 256x64-bit dualport SRAM as storage element)• 64 bit data width• Full crossbars as read and write shuffle blocks• 2R/1W ports• Four multi-lane configurations
• 8 / 16 / 32 / 64 lanes • Three PRF sizes:
• 32KB (64x64), 128KB (128x128), 512KB (256x256) • Clock frequency: 500 MHz - 970MHz• Dynamic Power: 300mW – 8.7W• Leakage: 10mW – 276mW
• Customized configurations• Up to 21% higher clock frequency• Up to 39% combinational hardware area reduction• Up to 10% reduction in total area• Reduced dynamic power by up to 31%, leakage with nearly 24%
Customized linear addressing functions
29
standardA ,i M j j Mp q q
customizedA ,i ji M jc c j Mp q q
It is possible to determine the linear address by only examining the upper left corner of the block being accessed for each memory module (k, l):
and coefficients:• depend on the MAFs and shape/size of the accesses, being are different for
each of the selected schemes • The inverse MAF is required: 1 , , ,m i j k l
jcic
The PRF contains data elements, supporting up to parallel vector lanesN M p q
Multi-view parallel access schemes
Scheme Conflict Free Parallel Accesses MAF row ( mrow ) MAF column ( mcol )
Rectangle Only (ReO)Rectangle Row (ReRo)
(p,q+1) and (p,q-1) co-prime
Rectangle Column (ReCo)(p+1,q) and (p-1,q)
co-prime
Row Column (RoCo)
Rect Transposed Rect (ReTr)
• Conflict-free parallel access for at least two rectangular shapes• Relaxes the p × q rectangle limitation of the ReO scheme
rectanglep qrectangle,
row, main andsecondary diagonals
p qp q p q
rectangle, column, main
and secondary diagonals
p qp q p q
row, column,
aligned % 0 or % 0
rectangle
p q p qi p j q
p q
and rectanglesp q q p
%i p %j q
%j q
%i p
%ji pq
%i j qp
%ji pq
%i j q
p
%i p % %i i p j q , % 0p q q p
30
Implementation diagram
Read / Write Data Shuffle
M02 M03
M10 M11 M12
M00 M01
M13
Read / Write Address Shuffle
AGU - compute i + α, j + β
A(i + α, j + β)
m(i + α, j + β)
i j Access TypeAddressData
Read Delay
ij
Access Type
ci, cj (i, j, Access Type) & A(ci, cj)
• Data is distributed among p x q memory modules• The AGU computes the addresses of all involved
elements• The generated addresses are fed to the Module
Assignment Function (MAF), which controls the read and write shuffles
Standard case: addresses need to be reordered according to the MAF before being sent to the memory modules
Customized case: eliminate the need to shuffle the read and write intra-module addresses.The shaded blocks are replaced by the ci, cj coefficients as well as the customized addressing function
31