dynamic management of microarchitecture resources in future processors rajeev balasubramonian dept....

55
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian t. of Computer Science, University of Roches

Post on 21-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Dynamic Management ofMicroarchitecture Resources

in Future Processors

Rajeev BalasubramonianDept. of Computer Science, University of Rochester

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Design Goals in Modern Processors

• High performance High clock speed High parallelism

• Low power

• Low design complexity Short, simple pipelines

Microprocessor designs strive for:

Unfortunately,not all can be achievedsimultaneously.

University of Rochester

Trade-Off in the Cache Size

CPU

L1 datacache

L1 datacache

Size/access time: 32KB cache/2 cycles 128KB/4 cycles“sort 4000” miss rate: very low very low“sort 4000” execution time: t t + x“sort 16000” miss rate: high very low“sort 16000” execution time: T T - X

CPU

University of Rochester

Trade-Off in the Register File Size

University of Rochester

Register file

The register file stores resultsfor all active instructions in theprocessor.

Large register file more active instructions high parallelism

long access times slow clock speed / more pipeline stages high power, design complexity

Trade-Offs Involving Resource Sizes

Trade-offs influence the design of the cache,register file, issue queue, etc.

Large resource size high parallelism, ability to support more threads

long latency long pipelines/ low clock speed

high power, high design complexity

University of Rochester

Parallelism-Latency Trade-Off

For each resource, performance depends on:

parallelism it can help extract

negative impact of its latency

Every program has different parallelism and

latency needs.

University of Rochester

Limitations of Conventional Designs

• Resource sizes are fixed at design time – the size that works best, on average, for all programs

• This average size is often too small or too large for many programs

For optimal performance, the hardware should match the program’s parallelism needs.

University of Rochester

Dynamic Resource Management

Reconfigurable memory hierarchy (MICRO’00, IEEE TOC, PACT’02)

Trade-offs in clusters (ISCA’03)

Selective pre-execution (ISCA’01)

• Efficient register file design (MICRO’01)

• Dynamic voltage/frequency scaling (HPCA’02)

University of Rochester

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Conventional Cache Hierarchies

CPU L1 L2 MainMemory

32KB2-way set-associative

2 cyclesMiss rate 2.3%

2MB8-way

20 cyclesMiss rate 0.2%

University of Rochester

Speed

Capacity

Conventional Cache Layout

Decoder

Output Driver

wordline

bitline

Address

Data

University of Rochester

way 0 way 1

Wire Delays

• Delay is a quadratic function of the wire length

• By inserting repeaters/buffers, delay grows roughly linearly with length

Length = xDelay ~ t

Length = 2xDelay ~ 4t

Length = 2xDelay ~ 2t + logic_delay

• Repeaters electrically isolate the wire segments

• Commonly used today in long wiresUniversity of Rochester

Exploiting Technology

Decoder

University of Rochester

The Reconfigurable Cache Layout

Decoder

way 0 way 1 way 2 way 3

University of Rochester

The Reconfigurable Cache Layout

Decoder

way 0 way 1 way 2 way 3

32KB 1-way cache, 2 cycles

University of Rochester

The Reconfigurable Cache Layout

Decoder

way 0 way 1 way 2 way 3

64KB 2-way cache, 3 cycles

The disabled portions of the cache are used as thenon-inclusive L2.

University of Rochester

Changing the Boundary between L1-L2

CPU

L1 L2

University of Rochester

Changing the Boundary between L1-L2

CPU

L1 L2

University of Rochester

Trade-Off in the Cache Size

CPU

L1 datacache

L1 datacache

Size/access time: 32KB cache/2 cycles 128KB/4 cycles“sort 4000” miss rate: very low very low“sort 4000” execution time: t t + x“sort 16000” miss rate: high very low“sort 16000” execution time: T T - X

CPU

University of Rochester

Salient Features

• Low-cost: Exploits the benefits of repeaters

• Optimizes the access time/capacity trade-off

• Can reduce energy -- most efficient when cache size equals working set size

University of Rochester

Control Mechanism

University of Rochester

Inspect stats.Is there a

phase change?

Run eachconfigurationfor an interval

Pick the bestconfiguration

Remain at theselected

configuration

Gather statistics at periodic intervals (every 10K instructions)

yes noexploration

Metrics

• Optimizing performance: metric for best configuration is simply instructions per cycle (IPC)

• Detecting a phase change: Change in branch frequency or miss rate

frequency or sudden change in IPC change in program phase

To avoid unnecessary explorations, the thresholds can be adapted at run-time

University of Rochester

Simulation Methodology

• Modified version of Simplescalar-3.0 -- includes many details on bus contention

• Executing programs from various benchmark sets (a mix of many program types)

University of Rochester

Performance Results

University of Rochester

0

0.4

0.8

1.2

1.6

2

em3d

healt

hm

st

com

pres

s

hydr

o2d

apsi

swim ar

tHM

Inst

ruct

ions

per

cyc

le (

IPC

) conventional

dynamic

Overall harmonic mean (HM) improvement: 17%

Energy Results

University of Rochester

0

0.05

0.1

0.15

0.2

0.25

0.3

em3d

healt

hm

st

com

pres

s

hydr

o2d

apsi

swim ar

tAM

En

erg

y p

er i

nst

ruct

ion

(n

J/in

str)

conventional 3-level

dynamic

Overall energy savings: 42%

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Conventional Processor Design

I Cache

BranchPredictor

Rename&

Dispatch

IssueQ

Register File

FU

FU

FU

FU

Large structures Slower clock speed

University of Rochester

The Clustered Processor

I Cache

BranchPredictor

Rename&

Dispatch

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

r1 r3 + r4

r41 r43 + r44

r2 r1 + r41r2 r1 + r41

Small structures Faster clock speed

But, high latency for someinstructions

University of Rochester

Emerging Trends

• Wire delays and faster clocks will make each cluster smaller

• Larger transistor budgets and low design cost will enable the implementation of many clusters on the chip

• The support of many threads will require many resources and clusters

Numerous, small clusters will be a reality!

University of Rochester

Communication Costs

University of Rochester

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

Regs

IQ FU

More clusters more communication

4 clusters

8 clusters

Communication vs Parallelism

University of Rochester

4 clusters 100 active instrs

r1 r2 + r3r5 r1 + r3

… …r7 r2 + r3r8 r7 + r3

8 clusters 200 active instrs

r1 r2 + r3r5 r1 + r3 … …r7 r2 + r3r8 r7 + r3 … …r5 r1 + r7 …r9 r2 + r3Distant parallelism:

distant instructionsthat are ready to execute

Ready instructions

Communication-Parallelism Trade-Off

• More clusters More communication More parallelism

• Selectively use more clusters

if communication is tolerable

if there is additional distant parallelism

University of Rochester

IPC with Many Clusters (ISCA’03)

University of Rochester

0

0.5

1

1.5

2

2.5

cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM

Inst

ruct

ions

per

cyc

le (

IPC

)

4-clusters8-clusters16-clusters

Trade-Off Management

• The clustered processor abstraction exposes the trade-off between communication and parallelism

• It also simplifies the management of resources -- we can disable a cluster by simply not dispatching instructions to it

University of Rochester

Control Mechanism

University of Rochester

Inspect stats.Is there a

phase change?

Run eachconfigurationfor an interval

Pick the bestconfiguration

Remain at theselected

configuration

Gather statistics at periodic intervals (every 10K instructions)

yes noexploration

The Interval Length

• Success depends on ability to repeat behavior across successive intervals

• Every program is likely to have phase changes at different granularities

• Must also pick the interval length at run-time

University of Rochester

Picking the Interval Length

• Start with minimum allowed interval length

• If phase changes are too frequent, double the interval length – find a coarse enough granularity such that behavior is consistent

• Repeat every 10 billion instructions

• Small interval lengths can result in noisy measurements

University of Rochester

Varied Interval Lengths

Benchmark Instability factor for a 10K

interval length

Minimum acceptable interval length and its instability factor

gzip 4% 10K / 4%

vpr 14% 320K / 5%

crafty 30% 320K / 4%

parser 12% 40M / 5%

swim 0% 10K / 0%

mgrid 0% 10K / 0%

galgel 1% 10K / 1%

cjpeg 9% 40K / 4%

djpeg 31% 1280K / 1%

Instability factor: Percentage of intervals that flag a phase change.

University of Rochester

Results with Interval-Based Scheme

University of Rochester

0

0.5

1

1.5

2

2.5

cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM

Inst

ruct

ions

per

cyc

le (

IPC

)

4-clusters16-clustersinterval-based

Overall improvement: 11%

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Pre-Execution

University of Rochester

• Executing a subset of the program in advance

• Helps warm up various processor structures such as the cache and branch predictor

The Future Thread (ISCA’01)

• The main program thread executes every single instruction

• Some registers are reserved for the future thread so it can jump ahead

.........

. . ..Main thread

Pre-execution thread

University of Rochester

Key Innovations

• Ability to advance much further eager recycling of registers skipping idle instructions

• Integrating pre-executed results re-using register results correcting branch mispredicts prefetch into the caches

• Allocation of resources

University of Rochester

Trade-Offs in Resource Allocation

.........

. . ..Main thread

Future thread

University of Rochester

• Allocating more registers for the main thread favors nearby parallelism

• Allocating more registers for the future thread favors distant parallelism

• The interval-based mechanism can pick the optimal allocation

Pre-Execution Results

University of Rochester

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

em3d mst peri swim lucas sp bt go comp HM

Inst

ruct

ions

per

cyc

le (

IPC

)

base casepre-execution 12 regspre-execution dynamic

Overall improvement with 12 registers: 11%Overall improvement with dynamic allocation: 18%

Conclusion

• Emerging technologies will make trade-off management very vital

• Approaches to hardware adaptation• cache hierarchy• clustered processors• pre-execution threads

• The interval-based mechanism with exploration is robust and applies to most problem domains

University of Rochester

Talk Outline

• Trade-offs in future microprocessors

• Dynamic resource management

On-chip cache hierarchy

Clustered processors

Pre-execution threads

• Future work

University of Rochester

Future Scenarios

• Clustered designs can be used to produce all classes of processors

• A library of simple cluster cores – with different energy, clock speed, latency, and parallelism characteristics

• The role of the architect: putting these cores together on the chip and exploiting them to maximize performance

University of Rochester

Heterogeneous Clusters

• Having different clusters on the chip provides many options for instruction steering

• For example, a program limited by communication will benefit from large, slow cluster cores

• Non-critical instructions of a program could be steered to slow, energy-efficient clusters -- such clusters can also help reduce processor hot-spots

University of Rochester

Other Critical Problems

How does one build a highly clustered processor?

Where does the cache go?

What interconnect topology do we use?

How does multithreading affect these choices?

University of Rochester

Research synopses and papers available at:

http://www.cs.rochester.edu/~rajeev/research.html

More Details…

University of Rochester

University of Rochester

Slide Title

• Point 1.

University of Rochester