stretching performance potentials beyond hardware constraints

30
INTEGRATED SELECTION, P ARTITIONING AND PLACEMENT FRAMEWORK FOR RECONFIGURABLE  ARCHITEC TURES By: RANA MUHAMMAD BILAL

Upload: rana-mohammad-bilal

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 1/30

INTEGRATED SELECTION,

PARTITIONING ANDPLACEMENT FRAMEWORK

FOR RECONFIGURABLE

 ARCHITECTURES

By: RANA MUHAMMAD BILAL

Page 2: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 2/30

PRESENTATION CONTENTS

INTRODUCTION

PROBLEM MODEL

PRIOR ART + CONTRIBUTIONS OF WORK

DESCRIPTION OF WORK OVERVIEW

GENETIC SELECTION ALGORITHM

RECURSIVE BACK TRACKING PARTITION

 ALGORITHM GREEDY PARTITIONING ALGORITHM

PRIORITY PLACEMENT ALGORITHM

RESULTS

Page 3: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 3/30

INTRODUCTION

Market demands on performance, design turnaround and size Research Interest in

Reconfigurable computing

90% of time spent in 10% of code [90/10 rule] Port selected compute intensive code

blocks to hardware [Hot Areas]

Design

to be

Implemented

General Purpose Proces

(Software)

 ASIC/FPGA (Hardware)

Page 4: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 4/30

INTRODUCTION

Port selected code to

Hardware (save size +

time)… Can we reducesize further ???

Dynamic Reconfiguration:

Reuse hardware over time!

(+Size reduced, -

Reconfiguration cost ) Partial Dynamic

Reconfiguration: Tailor cut

Hardware reuse along

space (Only reconfigurewhen feasible + needed

Page 5: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 5/30

INTRODUCTION

Partial Reconfiguration:CommunicationNetwork reconfigurationoverhead Tiled

Partially reconfigurableSystems [16]

Intelligent choice of BinSizes to compensate

reduced flexibility(Contribution: Firstalgorithm forpartitioning)

Recon. Fabric

GPP

Recon. Fabric

GPP

[16]Markus Koester, Wayne Luk  , Jens Hagemeyer, Mario Porrmann  and Ulrich R ückert, “Design Optimizations for Tiled Partially Reconfigurable Systems”  in IEEE

TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 6, JUNE 2011  

Page 6: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 6/30

Problem Model

Formulate Design Problem Identify Hot Areas generate

CIS Table

Extract Loop Trace

 Available area on fabric Tasks to do

Selection (ChooseImplementation Variants)

Partitioning (Partition

Reconfigurable Area in Bins) Placement (Assign Bins for the

execution of ImplementationVariants)

Recon. Fabric

GPP

Circuit instantiated on a Tile canbe Coprocessor/Custom

Instruction (Contribution:

Framework applicable to both

models)(Contribution: Integrated Solution

forSelection/Partitioning/Placement)

•Reconfiguration and

Execution can Overlap

•One Reconfiguration at a time

Page 7: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 7/30

Prior Research Work

„Selection‟ Algorithms [9][10][19] either specific to „Coprocessor‟ model or „Custom

Instruction‟ model 

No Consideration for multiple Implementation Alternatives

Partial Reconfiguration not supported/Joint optimizationwith placement and partitioning not considered

„Placement‟ Algorithms [14][15]  Communication Overheads for partial reconfiguration

neglected Multi-sized tiles/bins not supported

Joint optimization with „Selection‟ and „Partitioning‟ notconsidered

 No Prior work on „Partitioning‟ Algorithm 

Miaoqing Huang, Vikram K. Narayana, Mohamed Bakhouya, Jaafar Gaber, Tarek El-Ghazawi “Efficient Mapping of Task Graphs onto Reconfigurable Hardware

Using Architectural Variants” in IEEE Transactions on Computers, Aug 2011 

Honglei Han, Wenju Liu, Wu Jigang and Guiyuan Jiang, “Efficient Algorithm for Hardware/Software Partitioning and

Scheduling on MPSoC” in JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 

Page 8: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 8/30

Designing on a TPR

 Architecturefor(i=0,i<1,i++)

{

first_term = i*i;

for(j=0,j<2,j++)

{

second_term = j*j;

answer(i,j) = first_term – second_term

}}

Hot Area 1

Implementation alternative 1:

-Execute on GPP

-Area requirement on fabric = 0 logic blocks

-Execution time = 4 clock cycles

Implementation alternative 2:

-Implement custom hardware to „square‟ 

-Area requirement on fabric = 1 logic blocks

-Execution time = 2 clock cycles

Hot Area 2

Implementation alternative 1:

-Execute on GPP

-Area requirement on fabric = 0 logic blocks

-Execution time = 7 clock cyclesImplementation alternative 2:

-Implement custom hardware to „square‟ and

subtract in GPP

-Area requirement on fabric = 2 logic blocks

-Execution time = 4 clock cycles

Implementation alternative 3:

-Implement custom hardware to „square‟ and

subtract

-Area requirement on fabric = 3 logic blocks-Execution time = 3 clock cycles

Loop trace/Execution sequence

122

 j

i Aij

 j

iT

ij

CIS

Table   C   h  r  o  m  o  s  o  m  e

   2   3   3

  g  e  n  e

Page 9: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 9/30

Overview of Framework

   G  e  n  e  r  a   t  e  n  e  w

  p  o  p  u   l  a   t   i  o  n

Page 10: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 10/30

Step 1: Genetic Selection

P: Population Limit

Fitness: Execution Time

(Goal of Genetic Optimizer is to Minimize

Fitness)

Page 11: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 11/30

Step2 – Partitioning 1Recursive Backtracking

3,2,1 3

3 2

1

1

List GoalBacktracking (index, current)

If index > length of list

return

for i from index to length of list

If current + list[i] = goal

candidate_solutionadd list[i] 

solutionsadd candidate_solution

candidate_solution [ ]

return

If current + list[i] < goal

candidate_solutionadd list[i] 

Backtracking (index + 1, current + list[i])

goal = Available Area

list = descending order sorted list of

all area requirements specified by

loop trace and chromosome under

consideration.

index = entry number of „list‟ under

processing

current = cumulative sum of „list‟

entries traversed in a particular

thread. stored in

Page 12: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 12/30

Step2 – Partitioning 2Greedy Partitioning

Page 13: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 13/30

Step 3: Placement

Get Area model from partitioning algorithm

Repeat: until at end of chromosome

Select next gene

If: Corresponding implementation variant already placed

Use same bin placement

Else: 

Loop: through all empty bins

Place in smallest bin satisfying area requirement

If: not placed until this step

Loop: through all filled binsDetermine future_reuse_ index*  of placed

implementation variant

Place in smallest bin with smallestfuture_reuse_index

satisfying area requirement

*future_reuse_ index: number of times same

task type reoccurs with same implementation

variant selection)

Page 14: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 14/30

Step 3: Placement

Page 15: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 15/30

Example: Place_n_Partition

Page 16: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 16/30

Example: Evolution

Chromosome Fitness Value

(Execution Time)

223 14 clock

cycles

131 14 clock

cycles111 18 clock

cycles

123 11 clock

cycles

Chromosome Fitness Value

(Execution Time)

213 13 clock

cycles

133 10 clock

cycles

Next

population 

1 chromosome with

best fitness is passed

unchanged to next

generation as “Elite” 

1 2 1

1 3 1

1 2 3

1 3 3

Exchange genes around a

randomly chosen position

Choose two

chromosomes at random

for crossover

Choose one chromosome

at random for mutation

Randomly choose a gene and

assign a random value (within

bounds) to it

2 2 3 2 1 3

crossove

r  

elit

mutation 

possible selection solutions for this simple problem 2 x 3 x 3

= 18.

Combining with Partition & Placement: 72 points.

We only explored 8 points

execution time reduced from 18 clock cycles to 10 clock

cycles.

Page 17: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 17/30

Page 18: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 18/30

Results

α = 2  α = 3  α = 4  α = 5  α = 6 

10.43841 16.32047 12.27786753 15.42699725 17.46641

4.428044 17.07921 26.28062361 8.516886931 15.74194

4.74934 13.38983 11.43867925 10.92150171 9.143519

12.88344 12.38318 11.30434783 9.322033898 19.22111

6.220096 6.487696 7.267144319 11.42533937 14.86291

0

2

4

6

8

10

12

14

16

18

α = 2  α = 3  α = 4  α = 5  α = 6 

Percent difference of PU b/w DP and GA

Percent difference of PU b/w DP and GA

Page 19: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 19/30

Test Case: Locality Sensitive

Hashing

Page 20: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 20/30

Locality Sensitive Hashing: Area

Req

Virtex-6 has 8 Registers and 4LUTs per Slice Slices used =

Max(Reg/8, LUT/4)

Page 21: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 21/30

Locality Sensitive Hashing: Loop

Trace/CIS

1111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

3333333333333333333333333333333333333333331111112333333333333333333333333333333333333333333333333333333333333333333

3333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112333333333333333333333333

3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

3333333333341111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

3333333333333333333333333333333333333333333333333333331111112333333333333333333333333333333333333333333333333333333

3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112333333333333

3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

3333333333333333333333341111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333

33333333333333333333333333333333333333333333333333333333333333333311111123333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112

3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

333333333333333333333333333333333334

 A = 111111

B = 2C = 150 times 3

D = 4

 ABCABCABCDABCABCABCDABCABCABCD

0 117 236 833

0 136 Inf Inf0 58 233 925

0 161 Inf Inf

1536 960 672 480

12 6 0 0

57600 38400 16800 12600

3 1 0 0

CIS: Area Requirement (Slices)

CIS: Execution Time(Cycles)

Loop Trace

Loop Trace (Original)

Substitutions

Page 22: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 22/30

LSH: Solution

1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1

1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3

1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4

1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1

0 0 0 14148 0 15553 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 925 15073 0 16478 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1536 1548 15073 15553 16478 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120

1536 1548 14148 15553 15565 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120 128123

Partitioning: 1 bin of Size 925 Slices

Loop Trace

Selection

Placement

econfiguration Map

Execution Map

Software Execution Time = 532341 cycles

Best Time (without Reconfig, Best CIS) = 117777 cycles

 Area Required for Best CIS = 17529 Slices

 Achieved Time = 128123 cycles

 Area used = 1028 Slices

Just 1.94 percent less than best possible Execution time

Using 17 times less area!

Reconfig. Overhead: 1838 Cycles

Page 23: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 23/30

LSH: Solution

1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1

1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3

1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4

1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1

Partitioning: 1 bin of Size 925 Slices

Loop Trace

Selection

Placement

econfiguration Map

Execution Map

Software Execution Time = 532341 cycles

Best Time (without Reconfig, Best CIS) = 117777 cycles

 Area Required for Best CIS = 17529 Slices

 Achieved Time = 127341 cycles

 Area used = 925 Slices

Just 1.79 percent less than best possible Execution time

Using 19 times less area!

Reconfig. Overhead: 0 Cycles

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 925 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1536 1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338

1536 1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338 127341

Page 24: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 24/30

LSH: Results (Execution Time

[Cycles])

Slices Area Ratio Software Dynamic Programming Genetic Algorithm

8765 α = 2  532341 125613 118392

5843 α = 3  532341 136320 119229

4382 α = 4  532341 131315 119341

3506 α = 5  532341 139991 119369

2922 α = 6  532341 145911 119776

115000

120000

125000

130000

135000

140000

145000

150000

α = 2  α = 3  α = 4  α = 5  α = 6 

DynamicProgramming

Genetic Algorithm

Page 25: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 25/30

QUESTIONS?  

Page 26: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 26/30

Push to PreviousConfiguration, if

feasible

 A physical_area

area

Selected CIS

takes up entire

available area in

current

Configuration

 Area available in

Current Configuration,

but not feasible for

CIS of previous tasks

[1]Dynamic Reconfig of CFU by T. Mitra, 2009c

Dynamic Programming (Ref)

Page 27: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 27/30

Line 3: Flooring Takes you to the end of Last Complete Config. If

it is empty, no need to explore more configurations

Line 7: Similar logic as Line 3, if all lopes done and thr is an

empty configuration, then end

[1]Dynamic Reconfig of CFU by T. Mitra, 2009c

Dynamic Programming (Ref)

Page 28: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 28/30

M M B M M B M M B M M M B M M B M M B M M M B M M B M M B M

1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4

1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1

Partitioning: 1 bin of Size 925 Slices

Loop Trace

Selection

Placement

Page 29: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 29/30

Configurabl

e Fabric

Configurabl

e Fabric

Virtual

 Area

Virtual

 AreaConfigurabl

e Fabric

Dynamic Reconfiguration Partial DynamicReconfiguration

Virtual AreaReconfiguration

Overhead

Page 30: Stretching Performance Potentials Beyond Hardware Constraints

8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints

http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 30/30

Recon. Fabric

GPP

Recon. Fabric

GPP

Partial Reconfigurable SystemTiled Reconfigurable System

 Any desired chunk of fabric may be

reconfiguredCommunication network can’t be

static

Pre-defined reconfigurable regions

Communication network can bestatic