design space exploration for a coarse grain accelerator farhad mehdipour, hamid noori, morteza saheb...

34
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour , Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University, Fukuoka, Japan *Amirkabir University of Technology

Upload: evan-hill

Post on 18-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Design Space Exploration for a Coarse Grain Accelerator

Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami

Kyushu University, Fukuoka, Japan

*Amirkabir University of Technology

Page 2: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

2

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing RAC for an Extensible

Processor Conclusion

Page 3: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

3

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing RAC for an Extensible

Processor Conclusion

Page 4: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

4

Designing Embedded Systems

Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2LD/ST: Load / Store

CFU: Custom Functional Unit

Page 5: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

5

Extensible Processors

Goals Improving the performance and energy efficiency Maintaining compatibility and flexibility

Using a hardware is augmented to the base processor for accelerating frequently executed portions of applications

Accelerator implementations custom hardware (such as ASIP or Extensible Processors) reconfigurable fine/coarse grain hw

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2LD/ST: Load / Store

CFU: Custom Functional Unit

Page 6: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

6

Custom Instructions

Instruction set customization hardware/software partitioning (Identifying critical segments in applications)

Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU)

Critical segments Most frequently executed portions of the applications

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1A Custom Instruction

A CI can be represented as a DFG

Page 7: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

7

Reconfigurable Processors

Adding and generating custom instructions after fabrication

Using a reconfigurable functional unit (RFU) instead of custom functional unit

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2RFU Config Mem

CFU: Custom Functional Unit

RFU: Reconfigurable Functional Unit

Page 8: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

8

Baseline Processor

ALU

Register File

RAC...

...

...

ConfigurationMemory

How a Reconfigurable Processor Works

GPP: General Purpose Processor

RAC: Reconfigurable Accelerator

400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ...

Hot Basic Block

Reconfigurable Processor

Page 9: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

9

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing RAC for an Extensible

Processor Conclusion

Page 10: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

10

Designing a Reconfigurable Accelerator (RAC)

Multitude of design parameters e.g. number of functional units, input/output ports, type of

functional units and etc.

Several design parameters high complexity of the RAC design the requirements for a methodological approach

A major challenge finding the right balance between the different quality

requirements (e.g. speedup, area, energy consumption)

Page 11: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

11

Traditional Design Process

Describing a reference model Verifying the model functional correctness Obtaining a rough estimates of performance Manual or semi-automatic generation of several

alternative designs Choosing the most suitable design based on various

performance metrics

Page 12: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

12

Design Space Exploration

Design Space Exploration (DSE) the process of analyzing several “functionally equivalent”

implementation alternatives to identify an optimal solution Can become too computationally expensive

Example: a design with four tasks on an architecture with three

processing modules and each have four possible configurations results in 500 design alternatives

Our Approach: Hybrid (Analytical & Quantitative) approach which drastically reduces the design time & effort

Page 13: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

13

Assumptions

RAC a matrix of FUs the width/height equal to w/h basically has a combinational logic

FUs in RAC are fully connected Except the lack of connections

from lower to upper rowsRAC Input Ports

RAC Output Ports

22FU

1

1FU

21FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

w

Basic Elements Functional Units (logic

resources) Multiplexers (routing

resources)

Page 14: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

14

Assumptions

CIs (DFGs) are mapped onto the RAC and executed at runtime

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1A Custom Instruction

Data Flow Graph

2

3

4

1

RAC Initial Architecture

5

Page 15: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

15

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing a RAC for an Extensible

Processor Conclusion

Page 16: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

16

The Problem

Determining a RAC   specification while optimizing Speedup & Area

Design parameters the RAC’s width and height the number of FUs, the number of input and output ports

Speedup

CCyclesOnRATotalClock

ocessorseCyclesOnBaTotalClockSpeedup

Pr

Page 17: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

17

Design Methodology- Tool Chain

Our Approach: A Hybrid (Analytical & Quantitative) DSE Approach

CI ExtractorCIs

(DFGs)

CI (DFG)Analyzer

Required Informationfor DSE

Design SpaceExploration

Accelerator FinalSpecification

Adjusting RACArchitecure

Specification ofAccelerator

Application

Page 18: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

18

Problem Formulation

the fraction of all DFGs with the width of i and height of j (DFG(i,j))

ij

W

i

H

j

ij CPUCCCPUCC

1 1

)()(

jiFUCPUCC ij

ij )()(

2

4

6

1 3

5

2

4

5

1 3

%743 : percentage of execution time concerns to all DFGs with the

width of 4 and height of 3 is 7%.

533 63

3

Average number of instructions in all DFG(i,j)s

The number of CCs required for executing DFG(i,j) on the base processor

Page 19: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

19

Problem Formulation

ij

W

i

wh

H

j

ij

wh RACCCRACCC

1 1

)()(

When one or both dimensions of a DFGs are greater than RAC’s dimensions:

Temporal Partitioning: Divides a DFG into time exclusive smaller DFGs

Reconfiguration overhead time for loading subsequent partitions of a DFG from the configuration memory onto the RAC

the number of Clock Cycles for executing DFG(i,j) on a RAC (w,h)

Page 20: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

20

Problem Formulation

hjwi ,)3hjwi ,)2

)()( wh

wh

ij RACDRACCC hjwi ,)1

hjwi ,)4

RACh

w

RAC

DFG

RAC

h

w

RAC

DFGh

w

RAC

DFG

Page 21: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

21

Delay of RAC

wkMUXDFUDRACDh

i

ki

h

i

ki

wh ,...,1,)()()(

1

11

)( kjMUXD = delay of mux(

kjs2 to 1)

kjmk

js 2log )1()1( wwjmkjRAC Input Ports

RAC Output Ports

22FU

1

1FU

21FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

wRAC’s Delay Path

Delay of FU(k,i) Delay of MUX(k,i)

Delay of RAC(w,h)

Page 22: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

22

Optimization Problem

Objective: Maximize( )

ij

W

i

wh

H

j

ij

ij

W

i

H

j

ij

wh

wh

RACCC

CPUCC

RACCC

CPUCCRACS

1 1

1 1

)(

)(

)(

)()(

)( whRACS

Page 23: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

23

Area of RAC

= delay of mux(kjs2 to 1)

w

i

h

j

ij

w

i

h

j

ij

wh MUXAFUARACA

1

1

11 1

)(*2)()(

)( ijMUXA

RAC Input Ports

RAC Output Ports

2

2FU

1

1FU

2

1FU

1

2FU Row1

FU FU FU FU

Col1

Row5

...

...

h

w

Area is a secondary optimization parameter

Page 24: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

24

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing a RAC for an Extensible

Processor Conclusion

Page 25: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

25

Designing RAC for a Reconfigurable Processor

AMBER: a reconfigurable processor targeted for embedded systems Main components

a base processor (general RISC processor) Sequencer and a coarse-grained reconfigurable functional unit (RFU)

Register File

ID/EXE Reg

RFU

ConfigurationMemory

ALUs

MUX SequencerSequencer

Table

EXE/MEM Reg

GPP Augmented HW

AMBER’s Architecture

Page 26: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

26

Quantitative Approach

22 applications of Mibench were attempted Applications were executed on Simplescalar and profiled Hot segments and DFGs were extracted from the applications No limitation in the RFU’s initial architecture DFGs were mapped on the initial RFU architecture

Specification of the designed architecture for AMBER’s RFU (Quantitative Approach)

Page 27: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

27

Hybrid DSE Approach

FU and various size multiplexers Synthesized using Hitachi 0.18um Measuring delay and area

DFGs are analyzed to extract required information quantitatively

Reconfiguration penalty time: 1 clock cycle

The base processor clock frequency: 166MHz

Page 28: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

28

Speedup Evaluation

14

710

1316

191 4 7

10

13

16

19

0

0.5

1

1.5

2

2.5

Speedup

Width

Height

0-0.5 0.5-1 1-1.5 1.5-2 2-2.5

Increasing RAC’s width more parallelism

In the widths larger than 6:• negative effect of growing the number of muxes and their sizes• no more speedup achievable

Increasing RAC’s Height • longer delay• speedup declines and• area increases

Page 29: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

29

Effect of the Base Processor Clock Frequency

0

0.5

1

1.5

2

2.5

0 200 400 600 800 1000

Base Processor Clock Freq. (MHz)

Max

. Sp

eed

up

(6,4

)

450(5

,3)

(6,4

)(5

,4)

(5,4

)

(5,4

)

(5,2

)

(6,1

)

• Increasing clock frequency Reduction in the maximum achievable speedup

• no more speedup in the clock frequencies more than 450MHz

Page 30: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

30

Effect of the Reconfiguration Overhead Time

By increasing the reconfiguration overhead time

• The maximum achievable speedup degrades

• Height of RAC grows and longer DFGs are mappable

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12

Reconfiguration Penalty (Clock Cycles)

Sp

eed

up

(6,3

)

(6,4

)

(6,4

)

(6,4

)

(6,5

)

(6,5

)

(6,6

)

Page 31: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

31

Comparison

Design Method

Design Time & Effort

Basic Design Parameters

Flexibility*

Clark et al. Quantitative

&

Synthesis

High Mapping rate Low

Ours (previous) Quantitative High Mapping rate Low

Yehia et al. Synthesis &

Simulation

Very High No. of operations,

inputs/outputsLow

Ours

(current)

Hybrid Low Speedup & Area

High

Page 32: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

32

OUTLINE

Introduction Problem Definition and Basic ConceptsHybrid DSE Approach for Designing RACCase study: Designing a RAC for an Extensible

Processor Conclusion

Page 33: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

33

CONCLUSION

Hybrid DSE approach

Uses realistic data from the attempted applications

Substantially reduces design time and designer efforts

Can be used for shrinking a large design space

Easily extendable to apply new design parameters

More suitable where the new applications are added

Page 34: Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Thanks for your attention!