nc state university fabscalar center for efficient, scalable and reliable computing (cesr)...

21
NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina State University Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg

Upload: kole-kibbe

Post on 01-Apr-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

FabScalarFabScalar

Center for Efficient, Scalable and Reliable Computing (CESR)Department of Electrical and Computer Engineering

North Carolina State University

Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg

Page 2: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/092

Generic pipeline configuration↑ Good performance on wide range of applications

↓ Not highest-performing for any given application

↓ Power inefficient

High-Performance Superscalar ProcessorHigh-Performance Superscalar Processor

Page 3: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/093

Application-Specific Superscalar ProcessorApplication-Specific Superscalar Processor

App. X

App. X

generic superscalar processor

application-specific superscalar processor

Page 4: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/094

Propagation DelayPropagation Delay

2-way superscalar 4-way superscalar 2-way to 4-way:– Increase sizes of ILP-extracting

units to expose and exploit more ILP

– Hide increase in propagation delays with deeper pipelining

– Except: worsened propagation delays not hidden for inter-instruction dependences

dependencies independencies

2-way4-wayApp. 1

App. 2 2-way4-way

Execution Time

propagationdelay (ns)

Page 5: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/095

Heterogeneous Multi-coreHeterogeneous Multi-core

App. 1 App. 2 App. N

Customize each core to an application, class of application, or class of application behavior.

Page 6: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/096

Customization captures interplay between program, microarchitecture, and technology

Need real superscalar designs … … and need many of them

ChallengeChallenge

Need tool for automatically composing physical designs of arbitrary superscalar processors.

Need to try out many real superscalar designs.

Page 7: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/097

Research:High fidelity designs improve discovery

Development:Designs should be product strength

Target both R & DTarget both R & D

Page 8: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/098

Canonical Superscalar ProcessorCanonical Superscalar Processor

Different superscalar processors have same canonical pipeline stages

Their canonical stages differ in terms of:• Complexity

Width, i.e., number of superscalar “ways” Sizes of stage-specific structures

• Sub-pipelining How deeply pipelined a canonical stage is

Fetch Decode Rename Dispatch Issue Reg. Read Execute Load/Store Unit Writeback Retire

Page 9: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/099

1) Define composable interfaces of canonical pipeline stages, so that they can be stitched together to compose an overall superscalar processor.

2) Pre-design multiple versions of each canonical pipeline stage, that differ in their width and stage-specific structure sizes (complexity) and depth (sub-pipelining).

3) Develop a high-level superscalar synthesis tool that can automatically compose an arbitrary superscalar processor based on processor-level and stage-level constraints (frequency, power, and area), and output multiple representations (verilog, cycle-accurate C++, netlist, and physical design) of the processor.

FabScalarFabScalar

Page 10: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0910

SSL and ComposabilitySSL and Composability

fetchscalar,1 to 3 stages

2-way superscalar,1 to 3 stages

decode

rename

Page 11: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0911

Status Status

Designed synthesizable verilog for a baseline superscalar processor• Starting point for populating SSL with pipeline

stage designsStage Description

Fetch 4-wide, 512-entry BTB, 128-entry bimodal branch predictor, 8-entry RAS, 16-instruction fetch buffer

Decode 4-wide, ISA = PISA (MIPS-like)

Rename 4-wide, 32-entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints)

Dispatch 4-wide

Issue 4-wide issue, 32-entry issue queue

Register Read 4-wide, 128-entry physical register file with 8 read ports and 4 write ports

Execute 1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit

Load-Store Unit 16-entry load queue, 16 entry store queue

Writeback 4-wide

Retire 4-wide, 128-entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write portsNiket

Page 12: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0912

Status (cont.) Status (cont.)

Niket

Page 13: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0913

Status (cont.) Status (cont.)

Niket

(RegRead Delay) vs (Register File Size)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

32 64 128 256 512

Register File Size

Reg

Rea

d D

elay

(n

s)

Issue Width:2

Issue Width:4

Issue Width:6

(Select Logic Delay) vs (Issue Queue Size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

32 64 128

Issue Queue SizeS

elec

t L

og

ic D

elay

(n

s)

Issue Width:2

Issue Width:4

Issue Width:6

Issue Width:8

(Wakeup Delay) vs (Issue Queue Size)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

16 32 64 128

Issue Queue Size

Wak

eup

Del

ay (

ns)

Issue Width:2

Issue Width:4

Issue Width:6

Issue Width:8

(Fetch-2 Delay) vs (Fetch Width)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8

Fetch Width

Fet

ch-2

Del

ay (

ns)

(Rename Delay) vs (Rename Width)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8

Rename Width

Ren

ame

Del

ay (

ns)

Page 14: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0914

Developed cycle-accurate C++ simulator and verilog/C++ co-simulation environment• Cycle-accurate at pipeline stage level

Status (cont.) Status (cont.)

Salil

C++

C++

C++

C++

C++

verilog

verilog

verilog

verilog

verilog

==

==

==

==

==

==

FunctionalSimulator

C++

C++

C++

C++

C++

==

FunctionalSimulator

verilog

verilog

verilog

verilog

verilog

==

FunctionalSimulator

(a) Tightly integrated C++ & verilog. (b) Standalone C++. (c) Standalone verilog.

Figure 1. Flexible simulation options.

gap gcc gzip twolf vortex vpr

IPC 0.45 0.45 0.54 0.44 0.52 0.48

Page 15: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0915

Developed register file compiler• Superscalar processor has many

specialized and highly-ported RAM-based structures

Status (cont.) Status (cont.)

Tanmay

16R8W bitcell layout

Page 16: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0916

Begun sub-pipelining key stages: fetch and issue Block-ahead pipelining [Seznec et al.]

Status (cont.)Status (cont.)

AB

CD

AB

CD

Unpipelined Fetchthroughput = 1

Pipelined Fetch (no block-ahead)throughput = 1

AB

CD

Pipelined Fetch (with block-ahead)throughput = 2

Jayneel

Page 17: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0917

Example Applications Example Applications

Superscalar customization, fast design-space exploration

Sandeepbzip craft y/gap/vpr parser/per l/twolf gcc gzip vorte x mcf

bzip 8.6 2 7.39 6 .58 6.7 7.3 9 8 .42 3.5 8craft y 2.9 2 3.17 3 .17 2.8 3.1 4 3 2.4 5gap 3.7 8 4.02 3 .59 3.6 4 3 .96 2.7 6vpr 2.4 1 2.63 2 .61 2.3 7 2 .6 2 .46 2.1 8parser 3.1 7 3.33 3 .43 3.2 3.3 3 3 .16 2.7 3per l 2 .4 2.59 2 .64 2.5 1 2.5 8 2 .43 2.3 3tw olf 1 .9 2 .1 2 .11 1.9 5 2.0 7 1 .93 1 .9gcc 3.0 9 3 .2 3 .19 3.2 5 3.1 6 3 .16 2.6 8gzip 3 .2 3.39 3 .26 3.0 4 3.3 9 3 .28 2.4 5vorte x 4.7 6 4.97 4 .23 4.4 2 4.9 5 4 .97 2.9 5m cf 1.4 1 1 .5 1 .66 1.5 1 .5 1 .42 1.6 8

Core

Benc

hmar

k

benchmark/co re c lock RF IQ LQ /SQ f/d i/c I$ D$ L2 $ f d i r r ex m 1 m 2bz ip 0.3 5 512 64 4 8 6 6 64 64 10 24 7 4 5 5 2 6 3craft y/gap/vpr 0 .5 256 64 4 8 6 4 1 28 64 20 48 5 3 3 2 1 4 2parser/per l/twolf 0 .5 256 64 2 4 6 4 1 28 64 20 48 5 3 3 2 1 3 2gcc 0 .5 256 32 4 8 6 6 1 28 64 20 48 5 3 3 3 1 4 2gzip 0 .5 256 64 4 8 8 4 1 28 64 20 48 5 4 3 2 1 4 2vorte x 0.5 5 512 64 6 4 8 6 1 28 128 10 24 5 3 3 3 1 4 2m cf 0 .6 32 16 1 6 4 4 1 28 128 20 48 4 2 2 1 1 2 2

w idth s de pth s

Page 18: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0918

Example Applications (cont.)Example Applications (cont.)

Configure parallel processorfor parallel workload at hand.

Tiled Het. Multi-cores

Core-Selectability in Chip Multiprocessors

Hashem

Page 19: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0919

Revisit microarchitecture techniques Techniques discarded for limited applicability

may be valuable in workload-customized cores

Example Applications (cont.)Example Applications (cont.)

Page 20: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0920

Conventional methodology flawed• Arbitrarily pick a baseline (perhaps rules-of-thumb)• Add gadget to baseline• Speedup: (baseline+gadget) / (baseline)• Influence of gadget depends on choice of baseline

• Example: Value prediction more important with undersized IQ

OK methodology• Baseline = custom core for each benchmark• Add gadget to this baseline, per benchmark• Speedup: (baseline+gadget) / (baseline)

Better methodology• Baseline = custom core for each benchmark• Recustomize core with gadget in place (new global optimum)• Speedup: (recustomized core) / (customized core)

Example Applications (cont.)Example Applications (cont.)

Page 21: NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina

NC STATE UNIVERSITY

Eric Rotenberg © 2009 WARP’09 6/20/0921

Customizing superscalar cores has value in application-specific designs and heterogeneous multi-core chips

Customization captures interplay among program, microarchitecture, and technology

FabScalar enables the composition of arbitrary superscalar processors, inclusive of technology

Enabled by canonical view of superscalar pipeline, and a lot of “pre-fab” by students who aren’t paid enough

SummarySummary

acceptingdonations

http://www.tinker.ncsu.edu/ericro/research/fabscalar.htm

Supported by NSF and IBM.