register allocation for0.7mm high-level synthesis of0.7mm ... · fpgas suitable platform for...

29
Embedded Systems Chair for Faculty of Computer Science Institute of Computer Engineering, Chair for Embedded Systems REGISTER ALLOCATION FOR HIGH-LEVEL SYNTHESIS OF HARDWARE ACCELERATORS TARGETING FPGAS Gerald Hempel, Christian Hochberger, Jan Hoyer, Thilo Pionteck Darmstadt, 12 July 2013

Upload: others

Post on 24-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Faculty of Computer Science Institute of Computer Engineering, Chair for Embedded Systems

REGISTER ALLOCATION FORHIGH-LEVEL SYNTHESIS OFHARDWARE ACCELERATORSTARGETING FPGASGerald Hempel, Christian Hochberger, Jan Hoyer, Thilo Pionteck

Darmstadt, 12 July 2013

Page 2: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Outline

• Motivation• Design Flow• Register Allocation• Results & Discussion• Summary

Darmstadt, 12 July 2013 Register Allocation slide 2 of 14

Page 3: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Motivation

• FPGAs suitable platform for application-specific designs• Predominant concept → standard processor IP-core in combination with

application-specific accelerators• Goal: Automatic mapping of high level application code into HW

– Synthesis result is a combination of HLS and vendor synthesistools (XST, Synopsis, Altera Synthesizer)

How much effort is required in HLS optimization?

• Evaluation of several register allocation strategies for FPGA basedaccelerators on Spartan 6 and Artix 7

• Underlying synthesis tool: Xilinx XST P.49d in ISE 14.4

Darmstadt, 12 July 2013 Register Allocation slide 3 of 14

Page 4: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Motivation

• FPGAs suitable platform for application-specific designs• Predominant concept → standard processor IP-core in combination with

application-specific accelerators• Goal: Automatic mapping of high level application code into HW

– Synthesis result is a combination of HLS and vendor synthesistools (XST, Synopsis, Altera Synthesizer)

How much effort is required in HLS optimization?

• Evaluation of several register allocation strategies for FPGA basedaccelerators on Spartan 6 and Artix 7

• Underlying synthesis tool: Xilinx XST P.49d in ISE 14.4

Darmstadt, 12 July 2013 Register Allocation slide 3 of 14

Page 5: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

SpartanMC Soft-Core and Kernel Interface• SpartanMC processor soft-core for software execution

• Program and data stored in local BRAM• HW accelerators treated as peripherals• Peripheral stub used as wrapper for

accelerator• Peripheral stub provides access to

memory and peripheral bus– Parameters transfered via

peripheral bus during kernelstartup

– Direct BRAM accesses (triggeredby pointers and arrays) duringkernel execution

SpartanMC

ProcessorCore

Accelerator

Peripheral Stub

BRAM

Peripheral Interface

...

Darmstadt, 12 July 2013 Register Allocation slide 4 of 14

Page 6: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Typical Workflow

Darmstadt, 12 July 2013 Register Allocation slide 5 of 14

Page 7: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Typical Workflow

Darmstadt, 12 July 2013 Register Allocation slide 5 of 14

Page 8: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Kernel Extraction using GCC

• GIMPLE passes for analysis and synthesis

AnalysisTranscript

SourcesApplication

HDL

BinaryObjects

GCCSynthesis Runs

GCCAnalysis Runs

* *

Intermediate

GIMPLE TreeAnalysis &

Kernel Estimation

GIMPLE Tree

HDL GenerationPatch &

0011110110010111011001101010

0101011

• Analysis:– Finds most worthy loop– Omits functions calls

(inlined functions only)

• Synthesis:– Uses list scheduling– Performs high-level register

allocation

Darmstadt, 12 July 2013 Register Allocation slide 6 of 14

Page 9: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Kernel Extraction using GCC

• GIMPLE passes for analysis and synthesis

AnalysisTranscript

SourcesApplication

HDL

BinaryObjects

GCCSynthesis Runs

GCCAnalysis Runs

* *

Intermediate

GIMPLE TreeAnalysis &

Kernel Estimation

GIMPLE Tree

HDL GenerationPatch &

0011110110010111011001101010

0101011

• Analysis:– Finds most worthy loop– Omits functions calls

(inlined functions only)

• Synthesis:– Uses list scheduling– Performs high-level register

allocation

Darmstadt, 12 July 2013 Register Allocation slide 6 of 14

Page 10: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Kernel Extraction using GCC

• GIMPLE passes for analysis and synthesis

AnalysisTranscript

SourcesApplication

HDL

BinaryObjects

GCCSynthesis Runs

GCCAnalysis Runs

* *

Intermediate

GIMPLE TreeAnalysis &

Kernel Estimation

GIMPLE Tree

HDL GenerationPatch &

0011110110010111011001101010

0101011

• Analysis:– Finds most worthy loop– Omits functions calls

(inlined functions only)

• Synthesis:– Uses list scheduling– Performs high-level register

allocation

Darmstadt, 12 July 2013 Register Allocation slide 6 of 14

Page 11: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Register Allocation (Modified Left-Edge)

Darmstadt, 12 July 2013 Register Allocation slide 7 of 14

Page 12: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Register Allocation (Modified Left-Edge)

Darmstadt, 12 July 2013 Register Allocation slide 7 of 14

Page 13: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Register Allocation Strategies

• le simple: Assigns a register for each GIMPLE-variable

Darmstadt, 12 July 2013 Register Allocation slide 8 of 14

Page 14: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Register Allocation Strategies

• le full: Minimize the number of registers

Darmstadt, 12 July 2013 Register Allocation slide 8 of 14

Page 15: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Register Allocation Strategies

• le uid: Maps variables with identical GIMPLE-ID to one register

Darmstadt, 12 July 2013 Register Allocation slide 8 of 14

Page 16: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Benchmarks• Typical algorithms for embedded systems:

– base64 encoder, bitreverse, FFT, grayscale filter, IIR filter, haarwavelet transformation, matrix multiplication

• Kernels were generated automatically for compute intensive parts

• Parameter sweep:– Area or speed optimization– Spartan 6 or Artix 7 device– Allocation strategy

(le full, le simple, le 2, le 3, le 4,le 5, le uid)

– 8 benchmarks• 224 test cases O

ptim

izat

ion

Targ

et

Devic

e

Allocation Strategy

Darmstadt, 12 July 2013 Register Allocation slide 9 of 14

Page 17: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 Resource Consumption

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec

le_full le_uid le_simple le_2 le_3 le_4 le_5

1000

800

600

400

200

LUTs

Used LUTs (optimized for area)

Darmstadt, 12 July 2013 Register Allocation slide 10 of 14

Page 18: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 Resource Consumption

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec

le_full le_uid le_simple le_2 le_3 le_4 le_5

1000

800

600

400

200

LUTs

• Allocation strategies show little effecton resource consumption

• Best results for le simple and le uid

Used LUTs (optimized for area)

Darmstadt, 12 July 2013 Register Allocation slide 10 of 14

Page 19: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 Frequency Results

frequency

[M

Hz]

150

200

250

300

350

400

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec

le_full le_uid le_simple le_2 le_3 le_4 le_5

Achievable clock frequency(optimized for area)

Darmstadt, 12 July 2013 Register Allocation slide 11 of 14

Page 20: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 Frequency Results

frequency

[M

Hz]

150

200

250

300

350

400

base

64bi

t rev

erse

haar

wav

elet

FFT

gray

scal

e

IIR fi

lter

mat

rix m

ult

bit r

ever

se s

pec

le_full le_uid le_simple le_2 le_3 le_4 le_5

significant influence

• Results for le 2, le 3, le 4, le 5 hard topredict (s. haar wavelet, bit reverseand base 64 encoder)

• FFT, grayscale filter, IIR filter andmatrix multiplication → negativeeffect for all allocation strategiesexcept le uid and le simple

Achievable clock frequency(optimized for area)

Darmstadt, 12 July 2013 Register Allocation slide 11 of 14

Page 21: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

What is the difference?Benchmarks #GIMPLE

variables#basicblocks

#branches #ALUoperations

#MULoperations

base64 encode 50 3 1 21 0bitreverse 62 22 11 61 0bitreverse (spec.) 62 6 3 61 0haar wavelet 14 3 1 7 0

FFT 51 6 2 22 4grayscale filter 29 3 1 17 2IIR filter 79 6 2 29 11matrix mult 59 3 1 25 8

• Multiply-accumulate operation are mapped to sequential DSP primitives• Multiplexer tree caused by variable swaps leads to performance

degradations

Darmstadt, 12 July 2013 Register Allocation slide 12 of 14

Page 22: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

What is the difference?Benchmarks #GIMPLE

variables#basicblocks

#branches #ALUoperations

#MULoperations

base64 encode 50 3 1 21 0bitreverse 62 22 11 61 0bitreverse (spec.) 62 6 3 61 0haar wavelet 14 3 1 7 0

FFT 51 6 2 22 4grayscale filter 29 3 1 17 2IIR filter 79 6 2 29 11matrix mult 59 3 1 25 8

• Multiply-accumulate operation are mapped to sequential DSP primitives• Multiplexer tree caused by variable swaps leads to performance

degradations

Darmstadt, 12 July 2013 Register Allocation slide 12 of 14

Page 23: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 normalized Frequency

Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF

T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

fre

qu

en

cy

0.4

0.5

0.6

0.7

0.8

0.9

1

le_uid le_simple le_2 le_4 le_5 le_fullle_3

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• Achievable clock frequency optimized for area normalized to one

• Gap of about 30% between le simple, le uid and more complexalgorithms

Darmstadt, 12 July 2013 Register Allocation slide 13 of 14

Page 24: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 normalized Frequency

Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF

T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

fre

qu

en

cy

0.4

0.5

0.6

0.7

0.8

0.9

1

le_uid le_simple le_2 le_4 le_5 le_fullle_3

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• Achievable clock frequency optimized for area normalized to one• Gap of about 30% between le simple, le uid and more complex

algorithmsDarmstadt, 12 July 2013 Register Allocation slide 13 of 14

Page 25: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Conclusion

For HLS targeting FPGAs:

• Complex registers allocation strategies resulting in a reuse of a singleregister for many variables are not advisable

– Little effect on area consumption– May lead to performance degradations (up to 30%)

• Naive allocation strategy gives most freedom to synthesis tool andmostly better results

Darmstadt, 12 July 2013 Register Allocation slide 14 of 14

Page 26: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 normalized LUTs

Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF

T

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

norm

aliz

ed

usa

ge o

f LU

Ts

0.6

0.7

0.8

0.9

1

le_uid le_simple le_2 le_4 le_5 le_fullle_3

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

FFT

gray

scal

e fil

ter

mat

rix m

ult

IIR fi

lter

• A simple allocation strategy inclines a small resource footprint.

Darmstadt, 12 July 2013 Register Allocation slide 15 of 14

Page 27: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Artix 7 Frequency and LUTs (speed)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

150

200

250

300

350

400

freq

uen

cy[M

Hz]

le full le uid le simple le 2 le 3 le 4 le 5

• Achievable clock frequencyoptimized for speed

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

LU

Ts

le full le uid le simple le 2 le 3 le 4 le 5

• Used LUTs optimized forspeed

Darmstadt, 12 July 2013 Register Allocation slide 16 of 14

Page 28: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Spartan 6 Frequency and LUTs (area)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

100

150

200

250

300

350

freq

uen

cy[M

Hz]

le full le uid le simple le 2 le 3 le 4 le 5

• Achievable clock frequencyoptimized for area

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

LU

Ts

le full le uid le simple le 2 le 3 le 4 le 5

• Used LUTs optimized forarea

Darmstadt, 12 July 2013 Register Allocation slide 17 of 14

Page 29: Register Allocation for0.7mm High-Level Synthesis of0.7mm ... · FPGAs suitable platform for application-specific designs Predominant concept ! standard processor IP-core in combination

Embedded SystemsCh

air

for

Spartan 6 Frequency and LUTs (speed)

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

100

150

200

250

300

350

freq

uen

cy[M

Hz]

le full le uid le simple le 2 le 3 le 4 le 5

• Achievable clock frequencyoptimized for speed

base

64

bitre

vers

e

bitre

vers

esp

ec fft

gray

scale

haar

wavelet iir

mat

rixm

ult

200

400

600

800

1,000

1,200

1,400

LU

Ts

le full le uid le simple le 2 le 3 le 4 le 5

• Used LUTs optimized forspeed

Darmstadt, 12 July 2013 Register Allocation slide 18 of 14