dynamic management of microarchitecture resources in future processors rajeev balasubramonian dept....
Post on 21-Dec-2015
217 views
TRANSCRIPT
Dynamic Management ofMicroarchitecture Resources
in Future Processors
Rajeev BalasubramonianDept. of Computer Science, University of Rochester
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Design Goals in Modern Processors
• High performance High clock speed High parallelism
• Low power
• Low design complexity Short, simple pipelines
Microprocessor designs strive for:
Unfortunately,not all can be achievedsimultaneously.
University of Rochester
Trade-Off in the Cache Size
CPU
L1 datacache
L1 datacache
Size/access time: 32KB cache/2 cycles 128KB/4 cycles“sort 4000” miss rate: very low very low“sort 4000” execution time: t t + x“sort 16000” miss rate: high very low“sort 16000” execution time: T T - X
CPU
University of Rochester
Trade-Off in the Register File Size
University of Rochester
Register file
The register file stores resultsfor all active instructions in theprocessor.
Large register file more active instructions high parallelism
long access times slow clock speed / more pipeline stages high power, design complexity
Trade-Offs Involving Resource Sizes
Trade-offs influence the design of the cache,register file, issue queue, etc.
Large resource size high parallelism, ability to support more threads
long latency long pipelines/ low clock speed
high power, high design complexity
University of Rochester
Parallelism-Latency Trade-Off
For each resource, performance depends on:
parallelism it can help extract
negative impact of its latency
Every program has different parallelism and
latency needs.
University of Rochester
Limitations of Conventional Designs
• Resource sizes are fixed at design time – the size that works best, on average, for all programs
• This average size is often too small or too large for many programs
For optimal performance, the hardware should match the program’s parallelism needs.
University of Rochester
Dynamic Resource Management
Reconfigurable memory hierarchy (MICRO’00, IEEE TOC, PACT’02)
Trade-offs in clusters (ISCA’03)
Selective pre-execution (ISCA’01)
• Efficient register file design (MICRO’01)
• Dynamic voltage/frequency scaling (HPCA’02)
University of Rochester
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Conventional Cache Hierarchies
CPU L1 L2 MainMemory
32KB2-way set-associative
2 cyclesMiss rate 2.3%
2MB8-way
20 cyclesMiss rate 0.2%
University of Rochester
Speed
Capacity
Conventional Cache Layout
Decoder
Output Driver
wordline
bitline
Address
Data
University of Rochester
way 0 way 1
Wire Delays
• Delay is a quadratic function of the wire length
• By inserting repeaters/buffers, delay grows roughly linearly with length
Length = xDelay ~ t
Length = 2xDelay ~ 4t
Length = 2xDelay ~ 2t + logic_delay
• Repeaters electrically isolate the wire segments
• Commonly used today in long wiresUniversity of Rochester
The Reconfigurable Cache Layout
Decoder
way 0 way 1 way 2 way 3
32KB 1-way cache, 2 cycles
University of Rochester
The Reconfigurable Cache Layout
Decoder
way 0 way 1 way 2 way 3
64KB 2-way cache, 3 cycles
The disabled portions of the cache are used as thenon-inclusive L2.
University of Rochester
Trade-Off in the Cache Size
CPU
L1 datacache
L1 datacache
Size/access time: 32KB cache/2 cycles 128KB/4 cycles“sort 4000” miss rate: very low very low“sort 4000” execution time: t t + x“sort 16000” miss rate: high very low“sort 16000” execution time: T T - X
CPU
University of Rochester
Salient Features
• Low-cost: Exploits the benefits of repeaters
• Optimizes the access time/capacity trade-off
• Can reduce energy -- most efficient when cache size equals working set size
University of Rochester
Control Mechanism
University of Rochester
Inspect stats.Is there a
phase change?
Run eachconfigurationfor an interval
Pick the bestconfiguration
Remain at theselected
configuration
Gather statistics at periodic intervals (every 10K instructions)
yes noexploration
Metrics
• Optimizing performance: metric for best configuration is simply instructions per cycle (IPC)
• Detecting a phase change: Change in branch frequency or miss rate
frequency or sudden change in IPC change in program phase
To avoid unnecessary explorations, the thresholds can be adapted at run-time
University of Rochester
Simulation Methodology
• Modified version of Simplescalar-3.0 -- includes many details on bus contention
• Executing programs from various benchmark sets (a mix of many program types)
University of Rochester
Performance Results
University of Rochester
0
0.4
0.8
1.2
1.6
2
em3d
healt
hm
st
com
pres
s
hydr
o2d
apsi
swim ar
tHM
Inst
ruct
ions
per
cyc
le (
IPC
) conventional
dynamic
Overall harmonic mean (HM) improvement: 17%
Energy Results
University of Rochester
0
0.05
0.1
0.15
0.2
0.25
0.3
em3d
healt
hm
st
com
pres
s
hydr
o2d
apsi
swim ar
tAM
En
erg
y p
er i
nst
ruct
ion
(n
J/in
str)
conventional 3-level
dynamic
Overall energy savings: 42%
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Conventional Processor Design
I Cache
BranchPredictor
Rename&
Dispatch
IssueQ
Register File
FU
FU
FU
FU
Large structures Slower clock speed
University of Rochester
The Clustered Processor
I Cache
BranchPredictor
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
r1 r3 + r4
r41 r43 + r44
r2 r1 + r41r2 r1 + r41
Small structures Faster clock speed
But, high latency for someinstructions
University of Rochester
Emerging Trends
• Wire delays and faster clocks will make each cluster smaller
• Larger transistor budgets and low design cost will enable the implementation of many clusters on the chip
• The support of many threads will require many resources and clusters
Numerous, small clusters will be a reality!
University of Rochester
Communication Costs
University of Rochester
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
Regs
IQ FU
More clusters more communication
4 clusters
8 clusters
Communication vs Parallelism
University of Rochester
4 clusters 100 active instrs
r1 r2 + r3r5 r1 + r3
… …r7 r2 + r3r8 r7 + r3
8 clusters 200 active instrs
r1 r2 + r3r5 r1 + r3 … …r7 r2 + r3r8 r7 + r3 … …r5 r1 + r7 …r9 r2 + r3Distant parallelism:
distant instructionsthat are ready to execute
Ready instructions
Communication-Parallelism Trade-Off
• More clusters More communication More parallelism
• Selectively use more clusters
if communication is tolerable
if there is additional distant parallelism
University of Rochester
IPC with Many Clusters (ISCA’03)
University of Rochester
0
0.5
1
1.5
2
2.5
cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM
Inst
ruct
ions
per
cyc
le (
IPC
)
4-clusters8-clusters16-clusters
Trade-Off Management
• The clustered processor abstraction exposes the trade-off between communication and parallelism
• It also simplifies the management of resources -- we can disable a cluster by simply not dispatching instructions to it
University of Rochester
Control Mechanism
University of Rochester
Inspect stats.Is there a
phase change?
Run eachconfigurationfor an interval
Pick the bestconfiguration
Remain at theselected
configuration
Gather statistics at periodic intervals (every 10K instructions)
yes noexploration
The Interval Length
• Success depends on ability to repeat behavior across successive intervals
• Every program is likely to have phase changes at different granularities
• Must also pick the interval length at run-time
University of Rochester
Picking the Interval Length
• Start with minimum allowed interval length
• If phase changes are too frequent, double the interval length – find a coarse enough granularity such that behavior is consistent
• Repeat every 10 billion instructions
• Small interval lengths can result in noisy measurements
University of Rochester
Varied Interval Lengths
Benchmark Instability factor for a 10K
interval length
Minimum acceptable interval length and its instability factor
gzip 4% 10K / 4%
vpr 14% 320K / 5%
crafty 30% 320K / 4%
parser 12% 40M / 5%
swim 0% 10K / 0%
mgrid 0% 10K / 0%
galgel 1% 10K / 1%
cjpeg 9% 40K / 4%
djpeg 31% 1280K / 1%
Instability factor: Percentage of intervals that flag a phase change.
University of Rochester
Results with Interval-Based Scheme
University of Rochester
0
0.5
1
1.5
2
2.5
cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM
Inst
ruct
ions
per
cyc
le (
IPC
)
4-clusters16-clustersinterval-based
Overall improvement: 11%
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Pre-Execution
University of Rochester
• Executing a subset of the program in advance
• Helps warm up various processor structures such as the cache and branch predictor
The Future Thread (ISCA’01)
• The main program thread executes every single instruction
• Some registers are reserved for the future thread so it can jump ahead
.........
. . ..Main thread
Pre-execution thread
University of Rochester
Key Innovations
• Ability to advance much further eager recycling of registers skipping idle instructions
• Integrating pre-executed results re-using register results correcting branch mispredicts prefetch into the caches
• Allocation of resources
University of Rochester
Trade-Offs in Resource Allocation
.........
. . ..Main thread
Future thread
University of Rochester
• Allocating more registers for the main thread favors nearby parallelism
• Allocating more registers for the future thread favors distant parallelism
• The interval-based mechanism can pick the optimal allocation
Pre-Execution Results
University of Rochester
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
em3d mst peri swim lucas sp bt go comp HM
Inst
ruct
ions
per
cyc
le (
IPC
)
base casepre-execution 12 regspre-execution dynamic
Overall improvement with 12 registers: 11%Overall improvement with dynamic allocation: 18%
Conclusion
• Emerging technologies will make trade-off management very vital
• Approaches to hardware adaptation• cache hierarchy• clustered processors• pre-execution threads
• The interval-based mechanism with exploration is robust and applies to most problem domains
University of Rochester
Talk Outline
• Trade-offs in future microprocessors
• Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
• Future work
University of Rochester
Future Scenarios
• Clustered designs can be used to produce all classes of processors
• A library of simple cluster cores – with different energy, clock speed, latency, and parallelism characteristics
• The role of the architect: putting these cores together on the chip and exploiting them to maximize performance
University of Rochester
Heterogeneous Clusters
• Having different clusters on the chip provides many options for instruction steering
• For example, a program limited by communication will benefit from large, slow cluster cores
• Non-critical instructions of a program could be steered to slow, energy-efficient clusters -- such clusters can also help reduce processor hot-spots
University of Rochester
Other Critical Problems
How does one build a highly clustered processor?
Where does the cache go?
What interconnect topology do we use?
How does multithreading affect these choices?
University of Rochester
Research synopses and papers available at:
http://www.cs.rochester.edu/~rajeev/research.html
More Details…
University of Rochester