register allocation for0.7mm high-level synthesis of0.7mm ... · fpgas suitable platform for...

Embedded SystemsCh

air

for

Faculty of Computer Science Institute of Computer Engineering, Chair for Embedded Systems

REGISTER ALLOCATION FORHIGH-LEVEL SYNTHESIS OFHARDWARE ACCELERATORSTARGETING FPGASGerald Hempel, Christian Hochberger, Jan Hoyer, Thilo Pionteck

Darmstadt, 12 July 2013

Embedded SystemsCh

air

for

Outline

• Motivation• Design Flow• Register Allocation• Results & Discussion• Summary

Darmstadt, 12 July 2013 Register Allocation slide 2 of 14

Embedded SystemsCh

air

for

Motivation

• FPGAs suitable platform for application-specific designs• Predominant concept → standard processor IP-core in combination with

application-specific accelerators• Goal: Automatic mapping of high level application code into HW

– Synthesis result is a combination of HLS and vendor synthesistools (XST, Synopsis, Altera Synthesizer)

How much effort is required in HLS optimization?

• Evaluation of several register allocation strategies for FPGA basedaccelerators on Spartan 6 and Artix 7

• Underlying synthesis tool: Xilinx XST P.49d in ISE 14.4


Embedded SystemsCh

air

for

SpartanMC Soft-Core and Kernel Interface• SpartanMC processor soft-core for software execution

• Program and data stored in local BRAM• HW accelerators treated as peripherals• Peripheral stub used as wrapper for

accelerator• Peripheral stub provides access to

memory and peripheral bus– Parameters transfered via

peripheral bus during kernelstartup

– Direct BRAM accesses (triggeredby pointers and arrays) duringkernel execution

SpartanMC

ProcessorCore

Accelerator

Peripheral Stub

BRAM

Peripheral Interface

...


Embedded SystemsCh

air

for

Typical Workflow


Embedded SystemsCh

air

for

Kernel Extraction using GCC

• GIMPLE passes for analysis and synthesis

AnalysisTranscript

SourcesApplication

HDL

BinaryObjects

GCCSynthesis Runs

GCCAnalysis Runs

* *

Intermediate

GIMPLE TreeAnalysis &

Kernel Estimation

GIMPLE Tree

HDL GenerationPatch &

0011110110010111011001101010

0101011

• Analysis:– Finds most worthy loop– Omits functions calls

(inlined functions only)

• Synthesis:– Uses list scheduling– Performs high-level register

allocation


Embedded SystemsCh

air

for

Register Allocation (Modified Left-Edge)


Embedded SystemsCh

air

for

Register Allocation Strategies

• le simple: Assigns a register for each GIMPLE-variable


Embedded SystemsCh

air

for


• le full: Minimize the number of registers


Embedded SystemsCh

air

for


• le uid: Maps variables with identical GIMPLE-ID to one register


Embedded SystemsCh

air

for

Benchmarks• Typical algorithms for embedded systems:

– base64 encoder, bitreverse, FFT, grayscale filter, IIR filter, haarwavelet transformation, matrix multiplication

• Kernels were generated automatically for compute intensive parts

• Parameter sweep:– Area or speed optimization– Spartan 6 or Artix 7 device– Allocation strategy

(le full, le simple, le 2, le 3, le 4,le 5, le uid)

– 8 benchmarks• 224 test cases O

ptim

izat

ion

Targ

et

Devic

e

Allocation Strategy


Embedded SystemsCh

air

for

Artix 7 Resource Consumption

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec

le_full le_uid le_simple le_2 le_3 le_4 le_5

1000

800

600

400

200

LUTs

Used LUTs (optimized for area)


Embedded SystemsCh

air

for

Artix 7 Resource Consumption

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec


1000

800

600

400

200

LUTs

• Allocation strategies show little effecton resource consumption

• Best results for le simple and le uid

Used LUTs (optimized for area)


Embedded SystemsCh

air

for

Artix 7 Frequency Results

frequency

[M

Hz]

150

200

250

300

350

400

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec


Achievable clock frequency(optimized for area)


Embedded SystemsCh

air

for

Artix 7 Frequency Results

frequency

[M

Hz]

150

200

250

300

350

400

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec


significant influence

• Results for le 2, le 3, le 4, le 5 hard topredict (s. haar wavelet, bit reverseand base 64 encoder)

• FFT, grayscale filter, IIR filter andmatrix multiplication → negativeeffect for all allocation strategiesexcept le uid and le simple

Achievable clock frequency(optimized for area)


Embedded SystemsCh

air

for

What is the difference?Benchmarks #GIMPLE

variables#basicblocks

#branches #ALUoperations

#MULoperations

base64 encode 50 3 1 21 0bitreverse 62 22 11 61 0bitreverse (spec.) 62 6 3 61 0haar wavelet 14 3 1 7 0

FFT 51 6 2 22 4grayscale filter 29 3 1 17 2IIR filter 79 6 2 29 11matrix mult 59 3 1 25 8

• Multiply-accumulate operation are mapped to sequential DSP primitives• Multiplexer tree caused by variable swaps leads to performance

degradations


Embedded SystemsCh

air

for

Artix 7 normalized Frequency

Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF

T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

fre

qu

en

cy

0.4

0.5

0.6

0.7

0.8

0.9

1

le_uid le_simple le_2 le_4 le_5 le_fullle_3

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• Achievable clock frequency optimized for area normalized to one

• Gap of about 30% between le simple, le uid and more complexalgorithms


Embedded SystemsCh

air

for

Artix 7 normalized Frequency


T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

fre

qu

en

cy

0.4

0.5

0.6

0.7

0.8

0.9

1


FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• Achievable clock frequency optimized for area normalized to one• Gap of about 30% between le simple, le uid and more complex

algorithmsDarmstadt, 12 July 2013 Register Allocation slide 13 of 14

Embedded SystemsCh

air

for

Conclusion

For HLS targeting FPGAs:

• Complex registers allocation strategies resulting in a reuse of a singleregister for many variables are not advisable

– Little effect on area consumption– May lead to performance degradations (up to 30%)

• Naive allocation strategy gives most freedom to synthesis tool andmostly better results


Embedded SystemsCh

air

for

Artix 7 normalized LUTs


T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

usa

ge o

f LU

Ts

0.6

0.7

0.8

0.9

1


FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• A simple allocation strategy inclines a small resource footprint.


Embedded SystemsCh

air

for

Artix 7 Frequency and LUTs (speed)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

150

200

250

300

350

400

freq

uen

cy[M

Hz]

le full le uid le simple le 2 le 3 le 4 le 5

• Achievable clock frequencyoptimized for speed

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

LU

Ts


• Used LUTs optimized forspeed


Embedded SystemsCh

air

for

Spartan 6 Frequency and LUTs (area)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

100

150

200

250

300

350

freq

uen

cy[M

Hz]


• Achievable clock frequencyoptimized for area

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

LU

Ts


• Used LUTs optimized forarea


Embedded SystemsCh

air

for

Spartan 6 Frequency and LUTs (speed)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

100

150

200

250

300

350

freq

uen

cy[M

Hz]


• Achievable clock frequencyoptimized for speed

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

1,200

1,400

LU

Ts


• Used LUTs optimized forspeed


register allocation for0.7mm high-level synthesis of0.7mm ... · fpgas suitable platform for...

Documents