nc state university fabscalar center for efficient, scalable and reliable computing (cesr)...
TRANSCRIPT
NC STATE UNIVERSITY
FabScalarFabScalar
Center for Efficient, Scalable and Reliable Computing (CESR)Department of Electrical and Computer Engineering
North Carolina State University
Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/092
Generic pipeline configuration↑ Good performance on wide range of applications
↓ Not highest-performing for any given application
↓ Power inefficient
High-Performance Superscalar ProcessorHigh-Performance Superscalar Processor
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/093
Application-Specific Superscalar ProcessorApplication-Specific Superscalar Processor
App. X
App. X
generic superscalar processor
application-specific superscalar processor
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/094
Propagation DelayPropagation Delay
2-way superscalar 4-way superscalar 2-way to 4-way:– Increase sizes of ILP-extracting
units to expose and exploit more ILP
– Hide increase in propagation delays with deeper pipelining
– Except: worsened propagation delays not hidden for inter-instruction dependences
dependencies independencies
2-way4-wayApp. 1
App. 2 2-way4-way
Execution Time
propagationdelay (ns)
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/095
Heterogeneous Multi-coreHeterogeneous Multi-core
App. 1 App. 2 App. N
Customize each core to an application, class of application, or class of application behavior.
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/096
Customization captures interplay between program, microarchitecture, and technology
Need real superscalar designs … … and need many of them
ChallengeChallenge
Need tool for automatically composing physical designs of arbitrary superscalar processors.
Need to try out many real superscalar designs.
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/097
Research:High fidelity designs improve discovery
Development:Designs should be product strength
Target both R & DTarget both R & D
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/098
Canonical Superscalar ProcessorCanonical Superscalar Processor
Different superscalar processors have same canonical pipeline stages
Their canonical stages differ in terms of:• Complexity
Width, i.e., number of superscalar “ways” Sizes of stage-specific structures
• Sub-pipelining How deeply pipelined a canonical stage is
Fetch Decode Rename Dispatch Issue Reg. Read Execute Load/Store Unit Writeback Retire
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/099
1) Define composable interfaces of canonical pipeline stages, so that they can be stitched together to compose an overall superscalar processor.
2) Pre-design multiple versions of each canonical pipeline stage, that differ in their width and stage-specific structure sizes (complexity) and depth (sub-pipelining).
3) Develop a high-level superscalar synthesis tool that can automatically compose an arbitrary superscalar processor based on processor-level and stage-level constraints (frequency, power, and area), and output multiple representations (verilog, cycle-accurate C++, netlist, and physical design) of the processor.
FabScalarFabScalar
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0910
SSL and ComposabilitySSL and Composability
fetchscalar,1 to 3 stages
2-way superscalar,1 to 3 stages
decode
rename
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0911
Status Status
Designed synthesizable verilog for a baseline superscalar processor• Starting point for populating SSL with pipeline
stage designsStage Description
Fetch 4-wide, 512-entry BTB, 128-entry bimodal branch predictor, 8-entry RAS, 16-instruction fetch buffer
Decode 4-wide, ISA = PISA (MIPS-like)
Rename 4-wide, 32-entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints)
Dispatch 4-wide
Issue 4-wide issue, 32-entry issue queue
Register Read 4-wide, 128-entry physical register file with 8 read ports and 4 write ports
Execute 1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit
Load-Store Unit 16-entry load queue, 16 entry store queue
Writeback 4-wide
Retire 4-wide, 128-entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write portsNiket
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0912
Status (cont.) Status (cont.)
Niket
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0913
Status (cont.) Status (cont.)
Niket
(RegRead Delay) vs (Register File Size)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
32 64 128 256 512
Register File Size
Reg
Rea
d D
elay
(n
s)
Issue Width:2
Issue Width:4
Issue Width:6
(Select Logic Delay) vs (Issue Queue Size)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
32 64 128
Issue Queue SizeS
elec
t L
og
ic D
elay
(n
s)
Issue Width:2
Issue Width:4
Issue Width:6
Issue Width:8
(Wakeup Delay) vs (Issue Queue Size)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
16 32 64 128
Issue Queue Size
Wak
eup
Del
ay (
ns)
Issue Width:2
Issue Width:4
Issue Width:6
Issue Width:8
(Fetch-2 Delay) vs (Fetch Width)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8
Fetch Width
Fet
ch-2
Del
ay (
ns)
(Rename Delay) vs (Rename Width)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
Rename Width
Ren
ame
Del
ay (
ns)
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0914
Developed cycle-accurate C++ simulator and verilog/C++ co-simulation environment• Cycle-accurate at pipeline stage level
Status (cont.) Status (cont.)
Salil
C++
C++
C++
C++
C++
verilog
verilog
verilog
verilog
verilog
==
==
==
==
==
==
FunctionalSimulator
C++
C++
C++
C++
C++
==
FunctionalSimulator
verilog
verilog
verilog
verilog
verilog
==
FunctionalSimulator
(a) Tightly integrated C++ & verilog. (b) Standalone C++. (c) Standalone verilog.
Figure 1. Flexible simulation options.
gap gcc gzip twolf vortex vpr
IPC 0.45 0.45 0.54 0.44 0.52 0.48
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0915
Developed register file compiler• Superscalar processor has many
specialized and highly-ported RAM-based structures
Status (cont.) Status (cont.)
Tanmay
16R8W bitcell layout
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0916
Begun sub-pipelining key stages: fetch and issue Block-ahead pipelining [Seznec et al.]
Status (cont.)Status (cont.)
AB
CD
AB
CD
Unpipelined Fetchthroughput = 1
Pipelined Fetch (no block-ahead)throughput = 1
AB
CD
Pipelined Fetch (with block-ahead)throughput = 2
Jayneel
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0917
Example Applications Example Applications
Superscalar customization, fast design-space exploration
Sandeepbzip craft y/gap/vpr parser/per l/twolf gcc gzip vorte x mcf
bzip 8.6 2 7.39 6 .58 6.7 7.3 9 8 .42 3.5 8craft y 2.9 2 3.17 3 .17 2.8 3.1 4 3 2.4 5gap 3.7 8 4.02 3 .59 3.6 4 3 .96 2.7 6vpr 2.4 1 2.63 2 .61 2.3 7 2 .6 2 .46 2.1 8parser 3.1 7 3.33 3 .43 3.2 3.3 3 3 .16 2.7 3per l 2 .4 2.59 2 .64 2.5 1 2.5 8 2 .43 2.3 3tw olf 1 .9 2 .1 2 .11 1.9 5 2.0 7 1 .93 1 .9gcc 3.0 9 3 .2 3 .19 3.2 5 3.1 6 3 .16 2.6 8gzip 3 .2 3.39 3 .26 3.0 4 3.3 9 3 .28 2.4 5vorte x 4.7 6 4.97 4 .23 4.4 2 4.9 5 4 .97 2.9 5m cf 1.4 1 1 .5 1 .66 1.5 1 .5 1 .42 1.6 8
Core
Benc
hmar
k
benchmark/co re c lock RF IQ LQ /SQ f/d i/c I$ D$ L2 $ f d i r r ex m 1 m 2bz ip 0.3 5 512 64 4 8 6 6 64 64 10 24 7 4 5 5 2 6 3craft y/gap/vpr 0 .5 256 64 4 8 6 4 1 28 64 20 48 5 3 3 2 1 4 2parser/per l/twolf 0 .5 256 64 2 4 6 4 1 28 64 20 48 5 3 3 2 1 3 2gcc 0 .5 256 32 4 8 6 6 1 28 64 20 48 5 3 3 3 1 4 2gzip 0 .5 256 64 4 8 8 4 1 28 64 20 48 5 4 3 2 1 4 2vorte x 0.5 5 512 64 6 4 8 6 1 28 128 10 24 5 3 3 3 1 4 2m cf 0 .6 32 16 1 6 4 4 1 28 128 20 48 4 2 2 1 1 2 2
w idth s de pth s
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0918
Example Applications (cont.)Example Applications (cont.)
Configure parallel processorfor parallel workload at hand.
Tiled Het. Multi-cores
Core-Selectability in Chip Multiprocessors
Hashem
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0919
Revisit microarchitecture techniques Techniques discarded for limited applicability
may be valuable in workload-customized cores
Example Applications (cont.)Example Applications (cont.)
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0920
Conventional methodology flawed• Arbitrarily pick a baseline (perhaps rules-of-thumb)• Add gadget to baseline• Speedup: (baseline+gadget) / (baseline)• Influence of gadget depends on choice of baseline
• Example: Value prediction more important with undersized IQ
OK methodology• Baseline = custom core for each benchmark• Add gadget to this baseline, per benchmark• Speedup: (baseline+gadget) / (baseline)
Better methodology• Baseline = custom core for each benchmark• Recustomize core with gadget in place (new global optimum)• Speedup: (recustomized core) / (customized core)
Example Applications (cont.)Example Applications (cont.)
NC STATE UNIVERSITY
Eric Rotenberg © 2009 WARP’09 6/20/0921
Customizing superscalar cores has value in application-specific designs and heterogeneous multi-core chips
Customization captures interplay among program, microarchitecture, and technology
FabScalar enables the composition of arbitrary superscalar processors, inclusive of technology
Enabled by canonical view of superscalar pipeline, and a lot of “pre-fab” by students who aren’t paid enough
SummarySummary
acceptingdonations
http://www.tinker.ncsu.edu/ericro/research/fabscalar.htm
Supported by NSF and IBM.