design space exploration for field-programmable compressor trees

Design Space Exploration for Field-Programmable Compressor Trees

Seyed Hosein Attarzadeh Niaki1 Alessandro Cevrero2 Philip Brisk3

Chrysostomos Nicopoulos3

Frank K. Gurkaynak4

Yusuf Leblebici2 Paolo Ienne3

1Royal Institute of TechnologySchool of Information and

Communication TechnologyStockholm, Sweden

Ecole Polytechnique Fédérale de Lausanne (EPFL)2School of Engineering

3School of Computer and Communication Sciences

Lausanne, Switzerland

4Swiss Federal Institute of Technology, Zurich

Microelectronics Design CenterZurich, Switzerland

Project Overview

• Goal– Accelerate multi-input addition on FPGAs

• H.264 motion estimation• 3G wireless base station channel cards• FIR filters• Exposed via systematic dataflow transformations

[Verma et al., TCAD 2008]

– Field Programmable Compressor Tree (FPCT)• [Cevrero et al., FPGA 2008]• More flexibility than DSP blocks

– Can benefit from dataflow transformations– DSP blocks cannot

• Better performance than LUT-based logic

2

Dataflow Transformations: Example

3

step 3

>>

&

delta

7

&4

SEL =

0+

SEL

+

step 1

>>

&

2

=

0

SEL

+

step 2

>>

&

1

=

0

vpdiff

step 3

>>

=

delta

1

&0

step 2

>>

SEL

0

=

delta

2

&0

step 1

>>

SEL

0

=

delta

4

&0

step 0

>>

SEL

0

vpdiff

∑

+Compressor

Tree

ADPCM

Contribution

• Design Space Exploration– Tune the design of an FPCT to match the needs of a

representative set of arithmetically intensive benchmarks

4

Outline

• Arithmetic Tutorial: Compressor Trees

• Field Programmable Compressor Tree

• Design Space Exploration

• Results

• Conclusion

5

Compressor Trees

6

Multi-input Adder Parallel Multiplier Multiply-Accumulate

+

S C

+

S C

m+n bits

mn bits

+ CPA

Partial Product Generator (PPG)

Compressor Tree

S

+

C

Arithmetics on FPGAs

• DSP blocks– Fixed-bitwidth multiply/MAC

• FPGA logic can be faster when there are bitwidth mismatches[Kuon and Rose, TCAD 2007]

– Cannot bypass PPG• No multi-input addition• Cannot exploit dataflow transformations that expose large compressor trees

[Verma et al. TCAD 2008]

• FPGA logic– 3-ary addition

• LUTs + carry-chains• Altera Stratix II-IV, Xilinx Virtex-5

– Compressor tree synthesis• [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008]• Faster than 3-ary adder trees• Does not use carry chains

7

Field Programmable Compressor Tree

• Programmable core integrated into an FPGA– Supports multi-input addition

• Unlike DSP blocks• Can exploit dataflow transformations

[Verma et al. TCAD 2008]

– Programmable to match the input operands • More flexible than DSP block

– Multiplication/MAC • FPGA logic generates partial products

8

Parallel Counters and Generalized Parallel Counters (GPCs)

9

6 input bits of rank i5 input bits of rank i+12 input bits of rank i+2

(2, 5, 6; 5) GPC

5 output bits of rank i, …, i+4

6:3 Counter

6 input bits of rank i3 output bits of rank i, i+1, i+2

m:n counter• count number of input bits set to 1 • m input bits• n = log2(m+1) output bits

GPC• Input bits may have different ranks

FPCT Motivation (1/2)

10

15

4

15:4

3

4:3

2

3:2

CPACPA

FPCT Motivation (2/2)

11

15:4

4:3

3:2

15:4

4:3

3:2

15:4

4:3

3:2

4:3

Carry Propagate Adder (CPA)

3:2

15:4

Compressor Slice (CSlice) Architecture

12

Register

16

31

Input Configuration Circuit

GPC Configuration Circuit

31:5

5:3

3:2

5:3

3:2

CPA

Independently drive each input bit to 0

The 31:5 counter can implement a variety of 16-input, 5-output GPCs

The CSlice can be configured to produce multiple output bits.

Drive all carry-in bits to zero to break the carry chain

Choose the carry-save outputs or the output of the final CPA.

Store the carry-save or CPA output to a bypassable register.

Depending on the configuration different carry-out bits are propagated to the next CSlice

FPCT Results

13

0

5

10

15

20

25

30

g721 hpoly m10x10 m20x20 videomixer

adpcm fir3 fir6 H.264 ME

Delay (ns)

Multiplier-based Benchmarks

Multi-input Addition Benchmarks

Use DSP blocks for multiplication

No Transformations Transformed [Verma et al., TCAD 2008]

3-ary adder tree

GPC Mapping

FPCT

No multipliers, but benefits from transformations

CSlice Design Space

14

Register

16

31

Input Configuration Circuit


31:5

5:3

3:2

5:3

3:2

CPA

1. GPCCC/ICC{enumerate}

2. First counter size (FCS) {15:4, 31:5}

3. Max. Output Rank Config. (MORC) {1, 2, 3}


15

15:4 Counter

Configuration Bit

GPC

Configuration

Circuit

Rank-0 inputsInputs can be configured as rank-0 or 1

Benchmark Circuits

• Always generate a sufficient number of CSlices for each benchmark

16

Benchmark Description FCSs Mappedmul5x5 5x5-bit multiplication 31:5; 15:4mul18x18 18x18-bit multiplication 31:5mul36x18 36x18-bit multiplication 31:5add8x32 Add 8 32-bit Integers 31:5, 15:4add16x16 Add 16 16-bit Integers 31:5FIR 3-tap FIR Filter 31:5SAD Sum-of-Absolute-Differences 31:5, 15:4

Mapping

17

Generate CSlice HDL

FCS, MORC, Input Bit Pattern

Delay/Area

DoneAll GPCCCs enumerated?

Enumerate next GPCCC

Map input bit pattern

Synthesize mapped FPCT

Delay Results (31:5)

18

Average Delay (FCS = 31:5)

0

1

2

3

4

5

6

7

(12,

7;5)

(11,

9;5)

(10,

11;5

)

(13,

5;5)

(1,9

,9;5

)

(12,

7;5)

(1,1

0,7;

5)

(2,7

,9;5

)

(1,8

,11;

5)

(2,6

,11;

5)

(1,1

,1,1

7;5)

(2,2

7;5)

(1,1

,25;

5)

(2,0

,23;

5)

(1,1

,0,1

9;5)

(1,0

,1,2

1;5)

(1,0

,27;

5)

(1,0

,0,2

3;5)

(1,2

9;5)

(31;

5)

0 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

MORC and GPCCC/ICC

ns

Best WorstAverage Delay (FCS = 31:5)

MORC and GPC Config. Circuit

ns

Area Results (31:5)

19

X

0

10000

20000

30000

40000

50000

60000

70000

(13,

5;5)

(11,

9;5)

(10,

11;5

)

(12,

7;5)

(1,9

,9;5

)

(1,1

0,7;

5)

(12,

7;5)

(11,

9;5)

(13,

5;5)

(12,

7;5)

(1,2

9;5)

(3,0

,19;

5)

(2,2

7;5)

(1,0

,0,2

3;5)

(1,0

,27;

5)

(1,1

,25;

5)

(2,0

,23;

5)

(1,2

9;5)

(1,0

,27;

5)

(31;

5)

1 1 1 1 1 1 2 2 2 0 1 2 2 2 1 2 2 2 2 2

MORC and GPCCC/ICCC

um

2

Average Area (FCS = 31:5)Worst

Delay Ranking4 2 3

16 5 7

Best

MORC and GPC Config. Circuit

m2

Utilization

• Input Utilization (Uin)– Fraction of first counter inputs used– Unused inputs driven to zero

• Output Utilization (Uout)– Fraction of CSlice outputs used if MORC > 1

• I/O Utilization (U = UinUout)– Acceptable due to correlation between Uin, Uout

• Prune the search space with utilization– Only synthesize FPCTs for which utilization is high– Reduce cost of searching entire space

20

Correlation Between Input/Output Utilization

21

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

GPCCC/ICC

Uti

liza

tio

n

Uin; MORC = 0 Uin; MORC = 1 Uin; MORC = 2 Uout; MORC = 0

Uout; MORC = 1 Uout; MORC = 2

mul36x18 I/O Utilization

GPC Config. Circuit

Uin (MORC = 1)

Uout (MORC = 1)

Uin (MORC = 2)

Uout (MORC = 2)

Uin (MORC = 3)

Uout (MORC = 3)

I/O Utilization Generally Finds the Best Data Points per Benchmark

22

mul36x18 (FCS = 31:5)

0

2

4

6

8

10

12

14

85000 90000 95000 100000 105000 110000 115000 120000 125000

Area

Del

ay

MORC = 1 MORC = 2 MORC = 3

4 points with maximum utilization for MORC = 2 and 3 respectively

m2

ns

mul36x18 Design Space: Delay vs. Area

Conclusion

• FPCT– Programmable compressor tree integrated into an FPGA for

improved arithmetic performance

• FPCT Design space exploration– Tune FPCT CSlice architecture to a set of benchmarks– Prune the design space with utilization

• Two Pareto-optimal design points found• 1. Best average delay, near the middle in terms of average area• 2. Six virtually indistinguishable points

– 2nd – 7th best average delays, 1st – 6th best average area

23

References

Cevrero, A., et al. Architectural improvements for field programmable counter arrays: enabling efficient synthesis of compressor trees on FPGAs. FPGA, February 2008, pp. 181-190.

Kuon, I., and Roses, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD, February 2007, pp. 203-215.

Parandeh-Afshar, H., Brisk, P., and Ienne, P. Efficient synthesis of compressor trees on FPGAs. ASPDAC, January 2008, pp. 138-143.

Parandeh-Afshar, H., Brisk, P., and Ienne, P. Improving synthesis of compressor trees on FPGAs via integer linear programming. DATE, April 2008, pp. 1256-1261.

Verma, A. K., Brisk, P., and Ienne, P. Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits. IEEE TCAD, October 2008, pp. 1761-1774.

24

design space exploration for field-programmable compressor trees

Documents