asia and south pacific design automation conference taipei, taiwan r.o.c.january 21, 2010

29
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010 Nagaraju Pothineni Google, India Philip Brisk UC Riverside Paolo Ienne EPFL Anshul Kumar IIT Delhi Kolin Paul IIT Delhi

Upload: toya

Post on 16-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors. Nagaraju Pothineni Google, India. Philip Brisk UC Riverside. Paolo Ienne EPFL. Anshul Kumar IIT Delhi. Kolin Paul IIT Delhi. Asia and South Pacific Design Automation Conference - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors

Asia and South Pacific Design Automation Conference

Taipei, Taiwan R.O.C. January 21, 2010

Nagaraju PothineniGoogle, India

Philip BriskUC Riverside

Paolo IenneEPFL

Anshul KumarIIT Delhi

Kolin PaulIIT Delhi

Page 2: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Extensible Processors

I$ RF D$ RF

Fetch Decode Execute Memory Write-back

ISE Instruction Set Extensions

CompilerApplications Assembly code with ISEs

1

Page 3: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Compilation Flow

Application source code Architecture description

Source-to-source CompilerHigh-level program optimizations

ISE identification

Rewrite source code with ISEs

Optimized source code rewritten with ISE calls

G1 Gn

…Behavioral

description of n ISEs

Target-specific compiler

Linker

Assembler

ISE Synthesizer

Processor Generator

Optimized source code rewritten with ISE calls

Machine code with ISE calls

Behavioral HDL description of n ISEs

Structural HDL description of the processor and n ISEs

Architecture description

G1 Gn

2

Page 4: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

ISE Synthesis FlowG1

Gn

I/O-constrained Scheduling

R W * =

RF I/O Ports Clock Period Constraint:

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Clk period < Yes NoDone * = * -

I/O-constrained Scheduling

3

Page 5: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

I/O-constrained Scheduling

A B C

D E F

RD1

RD2

RFWr1

RFB

A

D

C

E FA B

D

C

E F

• I/O ports are a resource constraint– Resource-constrained scheduling is NP-complete– Optimal algorithm [Pozzi and Ienne, CASES ’05]

4

Page 6: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

ISE Synthesis FlowG1

Gn

I/O-constrained Scheduling

R W * =

RF I/O Ports Clock Period Constraint:

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Clk period < Yes NoDone * = * -

5

Page 7: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Reschedule to Reduce Area• Goal:

– Minimize area• Constraints:

– Do not increase latency or clock period – I/O constraints

• Implementation: – Simulated annealing (details in the paper)

2 Adders, 1 Multiplier 1 Adder, 1 Multiplier

6

Page 8: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

ISE Synthesis FlowG1

Gn

I/O-constrained Scheduling

R W * =

RF I/O Ports Clock Period Constraint:

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Clk period < Yes NoDone * = * -

7

Page 9: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Decompose each ISE into Single-Cycle Operations

8

A B C

E

D

A

B C

E

D

B CA

E

D

ISE After Scheduling 1-cycle Ops

Page 10: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Decomposition Facilitates Resource Sharing within an ISE

9

B CA

E

DB

C

E

D

A

B CB

C

A

E

D

E

D

A

Page 11: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

ISE Synthesis FlowG1

Gn

I/O-constrained Scheduling

R W * =

RF I/O Ports Clock Period Constraint:

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Clk period < Yes NoDone * = * -

10

Page 12: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Resource Allocation and Binding

11

Two 1-cycle ISEs

Maximum-cost weighted common isomorphic subgraph problem

(Graph theory)

NP-complete

Weighted minimum-cost common supergraph

(WMCS) problem

(Graph theory)

2-input operation (Multiplexers required)

(VLSI)

?

NP-complete

Minimum cost common supergraph

Requires a multiplexer

Depends on the cost of the multiplexers compared to the merged operations!

No multiplexers needed

Higher cost common supergraph

Which solution is better?

Page 13: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Datapath Merging (DPM)

12

• Old problem formulation

– Based on WMCS problem (graph theory)• NP-complete [Bunke et al., ICIP ’02]

– Share as many operations/interconnects as possible• [Moreano et al., TCAD ’05; de Souza e al., JEA ’05]

– Optimize port assignment as a post-processing step• NP-complete [Pangrle, TCAD ’91]• Heuristic [Chen and Cong, ASPDAC ’04]

• Contribution: New problem formulation

– Accounts for multiplexer cost and port assignment

Page 14: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

New DPM Algorithms

13

• ILP Formulation

– See the paper for details

• Reduction to Max-Clique Problem

– Extends [Moreano et al., TCAD ’05]• Solve Max-Clique problem optimally using “Cliquer”

– Identify isomorphic subgraphs up-front• Merge isomorphic subgraphs rather than vertices/edges

• (Details in the paper)

Page 15: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Example

14

v1 v2

v3

e1 e2

v4 v5

v6

e3 e4

Partial merged datapathNew ISE fragment to merge

Vertex mappings:

1. Map v1 onto v4 2. Allocate a new resource r1; map v1 onto r1

r1

r2

r3

Edge mappings:

e1 could map onto: e3, (v4, r3), (r1, v5), (r1, r3)

Must be compatible with vertex mappings!

Page 16: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Compatibility

15

– Vertex/vertex compatibility

– Vertex/edge compatibility

Yes No

Yes No

Page 17: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Why Edge Mappings?

16

– Allocate an edge in the merged datapath– May require a multiplexer

Page 18: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Port Assignment

17

– Deterministic for non-commutative operators– NP-complete for every commutative operator

• [Pangrle, TCAD ’91]

Commutative OperatorL R L R

We want this! No!

Page 19: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Edge Mappings = Port Assignment

18

v1 v2

v3

e1 e2

v4 v5

v6

e3 e4

Mapping:e1: (v4, v6, L)

Mapping:e1: (v4, v6, R)

Commutative Operator

Page 20: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Compatibility Graph

19

• Vertices correspond to mappings

– Vertex mappings• Weight is 0 for vertex vertex• Weight is resource cost for vertex new resource

– Edge mappings, including port assignment• Weight is 0 if edge exists in merged datapath• Weight is estimated cost of increasing mux size by +1 otherwise

• Place edges between compatible mappings– Each max-clique corresponds to a complete binding solution– Goal is to find max-clique of minimum weight

Page 21: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Compatibility Graph

20

v1 v2

v3

e1 e2

v4 v5

v6

e3 e4

r1 r2 r3

fE(e2) = (r2, v6, L)

w = Amux(2)

fV(v3) = v6

w = 0

fE(e2) = (r2, v6, R)

w = Amux(2)

fE(e1) = (v4, v6, L)

w = 0

fE(e1) = (v4, v6, R)

w = Amux(2)

fE(e1) = (r1, v6, L)

w = Amux(2)

fE(e1) = (r1, v6, R)

w = Amux(2)

fE(e1) = (v4, r3, R)

fV(v1) = v4

w = 0

fV(v2) = r2

w = A( )

fV(v1) = r1

w = A( )

fE(e1) = (v4, r3, L)

w = 0 w = 0

fE(e1) = (r1, r3, L)

w = 0

fE(e1) = (r1, r3, R)

fE(e2) = (r2, r3, L)

w = 0

fE(e2) = (r2, r3, R)

w = 0

fV(v3) = r3

w = A( )

w = 0

fE(e2) = (r2, v6, L)

w = Amux(2)

fV(v3) = v6

w = 0

fE(e2) = (r2, v6, R)

w = Amux(2)

fE(e1) = (v4, v6, L)

w = 0

fE(e1) = (v4, v6, R)

w = Amux(2)

fE(e1) = (r1, v6, L)

w = Amux(2)

fE(e1) = (r1, v6, R)

w = Amux(2)

fE(e1) = (v4, r3, R)

fV(v1) = v4

w = 0

fV(v2) = r2

w = A( )

fV(v1) = r1

w = A( )

fE(e1) = (v4, r3, L)

w = 0 w = 0

fE(e1) = (r1, r3, L)

w = 0

fE(e1) = (r1, r3, R)

fE(e2) = (r2, r3, L) fE(e2) = (r2, r3, R)fV(v3) = r3

w = 0

v1, v4 v5

v6

e1 e4e3e2

v2, r2

v3, r3

w = 0 w = A( ) w = 0

fE(e2) = (r2, v6, L)

w = Amux(2)

fV(v3) = v6

w = 0

fE(e2) = (r2, v6, R)

w = Amux(2)

fE(e1) = (v4, v6, L)

w = 0

fE(e1) = (v4, v6, R)

w = Amux(2)

fE(e1) = (r1, v6, L)

w = Amux(2)

fE(e1) = (r1, v6, R)

w = Amux(2)

fE(e1) = (v4, r3, R)

fV(v1) = v4

w = 0

fV(v2) = r2

w = A( )

fV(v1) = r1

w = A( )

fE(e1) = (v4, r3, L)

w = 0 w = 0

fE(e1) = (r1, r3, L)

w = 0

fE(e1) = (r1, r3, R)

fE(e2) = (r2, r3, L)

w = 0

fE(e2) = (r2, r3, R)

w = 0

fV(v3) = r3

w = A( )

w = 0

v3, v6

v1, v4v2, r2 v5

e2 e4e1, e3

Page 22: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Experimental Setup

21

• Internally developed research compiler– 1-cycle ISEs [Atasu et al., DAC ’03]– RF has 2 read ports, 1 write port– Standard cell design flow, 0.18m technology node

• Five DPM algorithms– Baseline No resource sharing– ILP (Optimal) [This paper]– Our heuristic* [This paper]– Moreano’s heuristic* [Moreano et al., TCAD

’05]– Brisk’s heuristic [Brisk et al., DAC ’04]

* Max-cliques found by “Cliquer”

Page 23: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

1-cycle ISE Area Savings

22Moreano’s heuristic is sometimes competitive!Brisk’s heuristic performed as well as ours for one benchmark!Moreano’s heuristic is not always competitive!Brisk’s heuristic was NOT competitive for three benchmarks!Brisk’s heuristic outperformed Moreano’s for three benchmarks!

Page 24: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Critical Path Delay Increase

23

0

1

2

adpcm jpeg gsm encode des sha aes

Critical Path Delay Increase (%) – Our heuristic

Page 25: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Runtimes

24

• Baseline 0

• Optimal (ILP) 3-8 hours

• Our heuristic 2-10 minutes

• Moreano’s heuristic ~1 minute

• Brisk’s heuristic < 5-10 seconds

Page 26: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Experimental Setup

25

• Internally developed research compiler– Multi-cycle ISEs [Pozzi and Ienne, CASES ’05]– RF has 5 read ports, 2 write ports– Standard cell design flow, 0.18m technology node

• Four versions of our flow– Single-cycle ISE (None) (1-cycle)– Full flow (200 MHz) (Multi-cycle)– No rescheduling (200 MHz) (Multi-cycle)– Baseline flow (200 MHz) (Multi-

cycle)

(Resource sharing and binding step disabled)

Page 27: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Single vs. Multi-cycle ISEs

26

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ADPCM AES DES GSM H263 IDCT JPEG SHA

Area savings (%)

0% Baseline flow (200 MHz)

Single-cycle ISE (No clock period constraint)

Full flow (200 MHz)

Cost of extra registersResource sharing across multiple cycles of the same ISE

Page 28: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

ADPCM AES DES GSM H.263 IDCT JPEG SHA

Impact of Rescheduling

27

Normalized Area Single-cycle ISE (no clock period constraint)

Full Flow (200 MHz)

No rescheduling (200 MHz)

Page 29: Asia and South Pacific Design Automation Conference  Taipei, Taiwan R.O.C.January 21, 2010

Conclusion

28

• HLS Flow for ISEs– RF I/O constraints

• Min-latency scheduling is NP-complete

– Requires two scheduling steps• Rescheduling is important for area reduction

• Resource allocation and binding– Modeled as a datapath merging problem– New problem formulation

• Multiplexer cost• Port assignment