design space exploration for field-programmable compressor trees
DESCRIPTION
Design Space Exploration for Field-Programmable Compressor Trees. 1 Royal Institute of Technology School of Information and Communication Technology Stockholm, Sweden. Seyed Hosein Attarzadeh Niaki 1 Alessandro Cevrero 2 Philip Brisk 3 Chrysostomos Nicopoulos 3 Frank K. Gurkaynak 4 - PowerPoint PPT PresentationTRANSCRIPT
Design Space Exploration for Field-Programmable Compressor Trees
Seyed Hosein Attarzadeh Niaki1 Alessandro Cevrero2 Philip Brisk3
Chrysostomos Nicopoulos3
Frank K. Gurkaynak4
Yusuf Leblebici2 Paolo Ienne3
1Royal Institute of TechnologySchool of Information and
Communication TechnologyStockholm, Sweden
Ecole Polytechnique Fédérale de Lausanne (EPFL)2School of Engineering
3School of Computer and Communication Sciences
Lausanne, Switzerland
4Swiss Federal Institute of Technology, Zurich
Microelectronics Design CenterZurich, Switzerland
Project Overview
• Goal– Accelerate multi-input addition on FPGAs
• H.264 motion estimation• 3G wireless base station channel cards• FIR filters• Exposed via systematic dataflow transformations
[Verma et al., TCAD 2008]
– Field Programmable Compressor Tree (FPCT)• [Cevrero et al., FPGA 2008]• More flexibility than DSP blocks
– Can benefit from dataflow transformations– DSP blocks cannot
• Better performance than LUT-based logic
2
Dataflow Transformations: Example
3
step 3
>>
&
delta
7
&4
SEL =
0+
SEL
+
step 1
>>
&
2
=
0
SEL
+
step 2
>>
&
1
=
0
vpdiff
step 3
>>
=
delta
1
&0
step 2
>>
SEL
0
=
delta
2
&0
step 1
>>
SEL
0
=
delta
4
&0
step 0
>>
SEL
0
vpdiff
∑
+Compressor
Tree
ADPCM
Contribution
• Design Space Exploration– Tune the design of an FPCT to match the needs of a
representative set of arithmetically intensive benchmarks
4
Outline
• Arithmetic Tutorial: Compressor Trees
• Field Programmable Compressor Tree
• Design Space Exploration
• Results
• Conclusion
5
Compressor Trees
6
Multi-input Adder Parallel Multiplier Multiply-Accumulate
+
S C
+
S C
m+n bits
mn bits
+ CPA
Partial Product Generator (PPG)
Compressor Tree
S
+
C
Arithmetics on FPGAs
• DSP blocks– Fixed-bitwidth multiply/MAC
• FPGA logic can be faster when there are bitwidth mismatches[Kuon and Rose, TCAD 2007]
– Cannot bypass PPG• No multi-input addition• Cannot exploit dataflow transformations that expose large compressor trees
[Verma et al. TCAD 2008]
• FPGA logic– 3-ary addition
• LUTs + carry-chains• Altera Stratix II-IV, Xilinx Virtex-5
– Compressor tree synthesis• [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008]• Faster than 3-ary adder trees• Does not use carry chains
7
Field Programmable Compressor Tree
• Programmable core integrated into an FPGA– Supports multi-input addition
• Unlike DSP blocks• Can exploit dataflow transformations
[Verma et al. TCAD 2008]
– Programmable to match the input operands • More flexible than DSP block
– Multiplication/MAC • FPGA logic generates partial products
8
Parallel Counters and Generalized Parallel Counters (GPCs)
9
6 input bits of rank i5 input bits of rank i+12 input bits of rank i+2
(2, 5, 6; 5) GPC
5 output bits of rank i, …, i+4
6:3 Counter
6 input bits of rank i3 output bits of rank i, i+1, i+2
m:n counter• count number of input bits set to 1 • m input bits• n = log2(m+1) output bits
GPC• Input bits may have different ranks
FPCT Motivation (1/2)
10
15
4
15:4
3
4:3
2
3:2
CPACPA
FPCT Motivation (2/2)
11
15:4
4:3
3:2
15:4
4:3
3:2
15:4
4:3
3:2
4:3
Carry Propagate Adder (CPA)
3:2
15:4
Compressor Slice (CSlice) Architecture
12
Register
16
31
Input Configuration Circuit
GPC Configuration Circuit
31:5
5:3
3:2
5:3
3:2
CPA
Independently drive each input bit to 0
The 31:5 counter can implement a variety of 16-input, 5-output GPCs
The CSlice can be configured to produce multiple output bits.
Drive all carry-in bits to zero to break the carry chain
Choose the carry-save outputs or the output of the final CPA.
Store the carry-save or CPA output to a bypassable register.
Depending on the configuration different carry-out bits are propagated to the next CSlice
FPCT Results
13
0
5
10
15
20
25
30
g721 hpoly m10x10 m20x20 videomixer
adpcm fir3 fir6 H.264 ME
Delay (ns)
Multiplier-based Benchmarks
Multi-input Addition Benchmarks
Use DSP blocks for multiplication
No Transformations Transformed [Verma et al., TCAD 2008]
3-ary adder tree
GPC Mapping
FPCT
No multipliers, but benefits from transformations
CSlice Design Space
14
Register
16
31
Input Configuration Circuit
GPC Configuration Circuit
31:5
5:3
3:2
5:3
3:2
CPA
1. GPCCC/ICC{enumerate}
2. First counter size (FCS) {15:4, 31:5}
3. Max. Output Rank Config. (MORC) {1, 2, 3}
GPC Configuration Circuit
15
15:4 Counter
Configuration Bit
GPC
Configuration
Circuit
Rank-0 inputsInputs can be configured as rank-0 or 1
Benchmark Circuits
• Always generate a sufficient number of CSlices for each benchmark
16
Benchmark Description FCSs Mappedmul5x5 5x5-bit multiplication 31:5; 15:4mul18x18 18x18-bit multiplication 31:5mul36x18 36x18-bit multiplication 31:5add8x32 Add 8 32-bit Integers 31:5, 15:4add16x16 Add 16 16-bit Integers 31:5FIR 3-tap FIR Filter 31:5SAD Sum-of-Absolute-Differences 31:5, 15:4
Mapping
17
Generate CSlice HDL
FCS, MORC, Input Bit Pattern
Delay/Area
DoneAll GPCCCs enumerated?
Enumerate next GPCCC
Map input bit pattern
Synthesize mapped FPCT
Delay Results (31:5)
18
Average Delay (FCS = 31:5)
0
1
2
3
4
5
6
7
(12,
7;5)
(11,
9;5)
(10,
11;5
)
(13,
5;5)
(1,9
,9;5
)
(12,
7;5)
(1,1
0,7;
5)
(2,7
,9;5
)
(1,8
,11;
5)
(2,6
,11;
5)
(1,1
,1,1
7;5)
(2,2
7;5)
(1,1
,25;
5)
(2,0
,23;
5)
(1,1
,0,1
9;5)
(1,0
,1,2
1;5)
(1,0
,27;
5)
(1,0
,0,2
3;5)
(1,2
9;5)
(31;
5)
0 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
MORC and GPCCC/ICC
ns
Best WorstAverage Delay (FCS = 31:5)
MORC and GPC Config. Circuit
ns
Area Results (31:5)
19
X
0
10000
20000
30000
40000
50000
60000
70000
(13,
5;5)
(11,
9;5)
(10,
11;5
)
(12,
7;5)
(1,9
,9;5
)
(1,1
0,7;
5)
(12,
7;5)
(11,
9;5)
(13,
5;5)
(12,
7;5)
(1,2
9;5)
(3,0
,19;
5)
(2,2
7;5)
(1,0
,0,2
3;5)
(1,0
,27;
5)
(1,1
,25;
5)
(2,0
,23;
5)
(1,2
9;5)
(1,0
,27;
5)
(31;
5)
1 1 1 1 1 1 2 2 2 0 1 2 2 2 1 2 2 2 2 2
MORC and GPCCC/ICCC
um
2
Average Area (FCS = 31:5)Worst
Delay Ranking4 2 3
16 5 7
Best
MORC and GPC Config. Circuit
m2
Utilization
• Input Utilization (Uin)– Fraction of first counter inputs used– Unused inputs driven to zero
• Output Utilization (Uout)– Fraction of CSlice outputs used if MORC > 1
• I/O Utilization (U = UinUout)– Acceptable due to correlation between Uin, Uout
• Prune the search space with utilization– Only synthesize FPCTs for which utilization is high– Reduce cost of searching entire space
20
Correlation Between Input/Output Utilization
21
0
0.2
0.4
0.6
0.8
1
1.2
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
GPCCC/ICC
Uti
liza
tio
n
Uin; MORC = 0 Uin; MORC = 1 Uin; MORC = 2 Uout; MORC = 0
Uout; MORC = 1 Uout; MORC = 2
mul36x18 I/O Utilization
GPC Config. Circuit
Uin (MORC = 1)
Uout (MORC = 1)
Uin (MORC = 2)
Uout (MORC = 2)
Uin (MORC = 3)
Uout (MORC = 3)
I/O Utilization Generally Finds the Best Data Points per Benchmark
22
mul36x18 (FCS = 31:5)
0
2
4
6
8
10
12
14
85000 90000 95000 100000 105000 110000 115000 120000 125000
Area
Del
ay
MORC = 1 MORC = 2 MORC = 3
4 points with maximum utilization for MORC = 2 and 3 respectively
m2
ns
mul36x18 Design Space: Delay vs. Area
Conclusion
• FPCT– Programmable compressor tree integrated into an FPGA for
improved arithmetic performance
• FPCT Design space exploration– Tune FPCT CSlice architecture to a set of benchmarks– Prune the design space with utilization
• Two Pareto-optimal design points found• 1. Best average delay, near the middle in terms of average area• 2. Six virtually indistinguishable points
– 2nd – 7th best average delays, 1st – 6th best average area
23
References
Cevrero, A., et al. Architectural improvements for field programmable counter arrays: enabling efficient synthesis of compressor trees on FPGAs. FPGA, February 2008, pp. 181-190.
Kuon, I., and Roses, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD, February 2007, pp. 203-215.
Parandeh-Afshar, H., Brisk, P., and Ienne, P. Efficient synthesis of compressor trees on FPGAs. ASPDAC, January 2008, pp. 138-143.
Parandeh-Afshar, H., Brisk, P., and Ienne, P. Improving synthesis of compressor trees on FPGAs via integer linear programming. DATE, April 2008, pp. 1256-1261.
Verma, A. K., Brisk, P., and Ienne, P. Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits. IEEE TCAD, October 2008, pp. 1761-1774.
24