superpipelining: add more stages superscalar: multiple...

UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I

2004-11-09

Dave Patterson

(www.cs.berkeley.edu/~patterson)

John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 20 – Advanced Processors I

www-inst.eecs.berkeley.edu/~cs152/

1


Five optionsfor multiplierlatency ...

1-cycle optionis fully spatial.

35-cycle ismini-Lab 2.

2, 4, 5 cycles?

Last Time: Multipliers -- Space vs. Time

2


Today: Beyond the 5-stage pipeline

Taxonomy: Introduction to advanced processor techniques.

Superpipelining: Increasing the number of pipeline stages.

Superscalar: Issuing several instructions in a single cycle.

3


5 Stage Pipeline: A point of departure

CS 152 L10 Pipeline Intro (9) Fall 2004 © UC Regents

Graphically Representing MIPS Pipeline

Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?

ALUIM Reg DM Reg

Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

At best, the 5-stage pipeline

executes one instruction per

clock, with a clock period

determined by the slowest stage

Filling all delay slots(branch,load)

Perfect

caching

Application does not need multi-cycle instructions (multiply, divide, etc)

4


Superpipelining: Add more stages Today!

Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Reduce critical path by

adding more pipeline stages.

Difficulties: Added penalties for

load delays and branch misses.

Ultimate Limiter: As logic delay

goes to 0, FF clk-to-Q and setup.

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Example: 8-stage ARM XScale:

extra IF, ID, data cache stages.

5


Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Improve CPI by issuing

several instructions per cycle.

Difficulties: Load and branch

delays affect more instructions.

Ultimate Limiter: Programs may

be a poor match to issue rules.

!"#$%&

!"#$%

&'"()*+,-*.,,/

012.3-*4++556

789($:;9*<9:$*=

)'"'($%":#$:(#

>8#?

>8#?

.*(?(

.*(?(

+(?(+(?(+(?(

!"##$

%&%'#&(')

%*+,&*##$

%&%'#&(')

789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#

!"";B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*

%9$%"#*'*F89($:;9*89:$*

!"":9B8$#*$;*'*F89($:;9*89:$*G%

1C1-*"%C:#$%"*F:A%

H

('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9

'((%B$

'((%B$

!"#$%

&'"()*+,-*.,,/

012.3-*4++550

&8A$:BA%*789($:;9*<9:$#

I7IJ

KL

M4<

&%N

7'DD

7N8A

7D:@

I##8%

OPQR#

7PQR#

Example: CPU with floating

point ALUs: issue 1 FP + 1

integer instruction per cycle.

Superscalar: Multiple issues per cycle Today!

6


Out of Order: Going around stalls Next

Tuesday

Goal: Issue instructions out of program order

Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.664

7869":%"*7##;%*5<=<$'$<>8#? !"#$%!&'($

!"#$%&'( !" #$%& '()*$+ (

) !" #(% (,)*'+ !*%+

, -.!/" #0% #(% #$ ,

- 1.2" #3% #$% #$ (

. "45" #(% #$% #3 -

/ 6""" #78% #0% #( (

786>":%"? +*@.-+A*2**2**2**2**2**2**. 4*0*0 4 ,*2**2**2*, 1*1

( )

,-

.

/

49:;<=><&<>?@<AB@A;9&C<>D>9@?&A9?@<EB@A;9&(&F<;G&H>A9I&=A?CJ@BK>=

!"#$%

&'"()*+,-*.//0

123.4-*5+.660

9;$6>B69":%"*C<#D'$()

7E 7C FG

H5I &%=

E'::

E=;J

7##;%

' 7##;%*#$'K%*L;BB%"*)>J:#*=;J$<DJ%*<8#$";($<>8#*M'<$<8K*

$>*<##;%2

' C%(>:%*'::#*8%N$*<8#$";($<>8*$>*L;BB%"*<B*$)%"%*<#

#D'(%*'8:*$)%*<8#$";($<>8*:>%#*8>$*(';#%*'*FHO*

>"*FHF*)'P'":2

' H8Q*<8#$";($<>8*<8*L;BB%"*M)>#%*OHF*)'P'":#*'"%

#'$<#B<%:*('8*L%*:<#D'$()%:*)*+,#"+-.#!/#&+0/#+"$

120'!/34#'$,#353($62*98*'*M"<$%*L'(R*@FGA-*8%M

<8#$";($<>8#*='Q*K%$*%8'LJ%:2

ADDD

Example: MULTDwaiting on F4 to load ...

... so let ADDD gofirst

Difficulties: Bookkeeping is highly complex.

A poor fit for lockstep instruction scheduling.

Ultimate Limiter: The amount of instruction

level parallelism present in an application.

7


Dynamic Scheduling: End lockstep Next

Tuesday

Goal: Enable out-of-order by breaking

pipeline in two: fetch and execution.

Limiters: Design complexity,

instruction level parallelism.

Example: IBM Power 5:

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer


MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

8


Throughput and multiple threads Next

Thursday

Goal: Use multiple CPUs (real and virtual) to

improve (1) throughput of machines that run

many programs (2) execution time of multi-

threaded programs.

Difficulties: Gaining full advantage requires

rewriting applications, OS, libraries.

Ultimate limiter: Amdahl’s law, memory

system performance.

Example: Sun

Niagara (8 SPARCs

on one chip).

9


Administrivia: No class on Thursday!

HW 4: due Weds 11/10, 5PM, 283 Soda.

Friday 11/12: Lab 4 final demo in section.

Monday 11/15: Lab 4 final report due, 11:59 PM.

Final project (Lab 5) will be out soon

10


Administrivia: Mid-term and Field Trip

Xilinx field trip date: 11/30. Details on bus transport from Soda Hall soon.

Mid-Term II: Tuesday, 11/23, 5:30 to 8:30 PM, 101 Morgan.

Mid-Term II Review Session: Sunday, 11/21, 7-9 PM, 306 Soda.

Thanksgiving Holiday!

11


Superpipelining

12


Add pipeline stages, reduce clock period

Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Q. Could adding pipeline stages

reduce CPI for an application?

ARM XScale

8 stages

CPI Problem Possible Solution

Extra branch delays Branch prediction

Extra load delays Optimize code

Structural hazards

Optimize code, add hardware

A. Yes, due to these problems:

13


Hardware limits to superpipelining?

!"#$%&'()*#+&$,-

./#+&$,-0(,#$.&"12-13 456756887 9,#$.&"1):$';-"(',<

!"#$%&'(&)*+

8

=8

68

78

48

>8

?8

@8

A8

B8

=88

A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= 86 87 84 8>

'$,-/)7A?

'$,-/)4A?

'$,-/)C-$,'3D

'$,-/)C-$,'3D)6

'$,-/)C-$,'3D)7

'$,-/)C-$,'3D)4

'$,-/)',#$'3D

E/CF#)6=8?4

E/CF#)6==?4

E/CF#)6=6?4

9C#"%

93C-"9C#"%

9C#"%?4

G'C(

HI)IE

I&J-")IK

EGL)M?

EGL)M@

EGL)NA?O?4

Thanks to Francois Labonte, Stanford

FO4

Delays

FO4: How many fanout-of-4 inverter delays in the clock period.

Historicallimit:about12

CPU Clock Periods1985-2005

MIPS 20005 stages

Pentium 420 stages

Pentium Pro10 stages

14


Is there an optimal pipeline depth?

6 FO4 delays

Methodology: Simulate standard benchmarks on many different pipeline designs.

Source: “The Optimal Logic Depth Per Pipeline Stage is 6 to

8 FO4 Inverter Delays”. M.S. Hrishikesh et al. ISCA 2002.

2 4 6 8 10 12 14 16

Useful logic per stage (FO4)

0.0

0.5

1.0

1.5

Per

form

an

ce (

BIP

S)

Vector FP

Integer

Non-vector FP

2 4 6 8 10 12 14 16

Useful logic per stage (FO4)

0.0

0.5

1.0

1.5

Per

form

an

ce (

BIP

S)

Vector FP

Integer

Non-vector FP

(a) (b)

Figure 4: In-order pipeline performance with and without latch overhead. Figure 4a shows that when there is no latch overhead performance

improves as pipeline depth is increased. When latch and clock overheads are considered, maximum performance is obtained with 6 FO4 useful

logic per stage , as shown in Figure 4b.

chitectural components can be perfectly pipelined and be par-

titioned into an arbitrary number of stages.

4.1 In-order Issue Processors

Figure 4a shows the harmonic mean of the performance of

SPEC 2000 benchmarks for an in-order pipeline, if there were

no overheads associated with pipelining ( = 0) and

performance was inhibited by only the data and control depen-

dencies in the benchmark. The x-axis in Figure 4a represents

and the y-axis shows performance in billions of instruc-

tions per second (BIPS). Performance was computed as a prod-

uct of IPC and the clock frequency—equal to 1/ . The in-

teger benchmarks have a lower overall performance compared

to the vector floating-point (FP) benchmarks. The vector FP

benchmarks are representative of scientific code that operate

on large matrices and have more ILP than the integer bench-

marks. Therefore, even though the execution core has just two

floating-point units, the vector benchmarks out perform the in-

teger benchmarks. The non-vector FP benchmarks represent

scientific workloads of a different nature, such as numerical

analysis and molecular dynamics. They have less ILP than

the vector benchmarks, and consequently their performance

is lower than both the integer and floating-point benchmarks.

For all three sets of benchmarks, doubling the clock frequency

does not double the performance. When is reduced from

8 to 4 FO4, the ideal improvement in performance is 100%.

However, for the integer benchmarks the improvement is only

18%. As is further decreased, the improvement in per-

formance deviates further from the ideal value.

Figure 4b shows performance of the in-order pipeline with

set to 1.8 FO4. Unlike in Figure 4a, in this graph the

clock frequency is determined by 1/( + ). For

example, at the point in the graph where is equal to 8

FO4, the clock frequency is 1/(10 FO4). Observe that max-

imum performance is obtained when corresponds to 6

FO4. In this experiment, when is reduced from 10 to 6

FO4 the improvement in performance is only about 9% com-

pared to a clock frequency improvement of 50%.

4.2 Comparison with the CRAY-1S

Kunkel and Smith [9] observed for the Cray-1S that maximum

performance can be achieved with 8 gate levels of useful logic

per stage for scalar benchmarks and 4 gate levels for vector

benchmarks. If the Cray-1S were to be designed in CMOS

logic today, the equivalent latency of one logic level would

be about 1.36 FO4, as derived in Appendix A. For the Cray-

1S computer this equivalent would place the optimal at

10.9 FO4 for scalar and 5.4 FO4 for vector benchmarks. The

optimal for vector benchmarks has remained more or

less unchanged, largely because the vector benchmarks have

ample ILP, which is exploited sufficiently well by both the in-

order superscalar pipeline and the Cray-1S. The optimal

for integer benchmarks has more than halved since the time of

the Cray-1S processor, which means that a processor designed

using modern techniques can be clocked at more than twice

the frequency.

One reason for the decrease in the optimal of inte-

ger benchmarks is that in modern pipelines average memory

access latencies are lower, due to on-chip caches. The Alpha

21264 has a two-level cache hierarchy comprising of a 3-cycle,

level-1 data cache and an off-chip unified level-2 cache. In

the Cray-1S all loads and stores directly accessed a 12-cycle

memory. Integer benchmarks have a large number of depen-

dencies, and any instruction dependent on loads would stall

the pipeline for 12 cycles. With performance bottlenecks in

the memory system, increasing clock frequency by pipelining

more deeply does not improve performance. We examined the

effect of scaling a superscalar, in-order pipeline with a mem-

ory system similar to the CRAY-1S (12 cycle access memory

access, no caches) and found that the optimal was 11

FO4 for integer benchmarks.

A second reason for the decrease in optimal is the

change in implementation technology. Kunkel and Smith as-

sumed the processor was implemented using many chips at

relatively small levels of integration, without binning of parts

to reduce manufacturer’s worst case delay variations. Con-

sequently, they assumed overheads due to latches, data, and

clock skew that were as much as 2.5 gate delays [9] (3.4 FO4).

benchmark

benchmark

benchmark

In Resources section of class website

15


Superscalar

16


!"#$%&

!"#$%

&'"()*+,-*.,,/

012.3-*4++556

789($:;9*<9:$*=)'"'($%":#$:(#

>8#?

>8#?

.*(?( .*(?(

+(?( +(?( +(?(

!"##$

%&%'#&(')

%*+,&*##$

%&%'#&(')

789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#

!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*

%9$%"#*'*F89($:;9*89:$*

!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H

('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9

'((%B$

'((%B$

!"#$%

&'"()*+,-*.,,/

012.3-*4++550

&8A$:BA%*789($:;9*<9:$#

I7 IJ KL

M4< &%N

7'DD

7N8A

7D:@

I##8%

OPQR#

7PQR#

Example: Superscalar MIPS. Fetches 2

instructions at a time. If first integer and

second floating point, issue in same cycle

Superscalar: A simple example ...

Integer instruction FP instruction

LD F0,0(R1)LD F6,-8(R1)LD F10,-16(R1) ADDD F4,F0,F2LD F14,-24(R1) ADDD F8,F6,F2LD F18,-32(R1) ADDD F12,F10,F2SD 0(R1),F4 ADDD F16,F14,F2SD -8(R1),F8 ADDD F20,F18,F2SD -16(R1),F12SD -24(R1),F16

Two issuesper cycle

One issueper cycle

17


!"#$%&

!"#$%

&'"()*+,-*.,,/

012.3-*4++556

789($:;9*<9:$*=)'"'($%":#$:(#

>8#?

>8#?

.*(?( .*(?(

+(?( +(?( +(?(

!"##$

%&%'#&(')

%*+,&*##$

%&%'#&(')

789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#

!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*

%9$%"#*'*F89($:;9*89:$*

!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H

('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9

'((%B$

'((%B$

!"#$%

&'"()*+,-*.,,/

012.3-*4++550

&8A$:BA%*789($:;9*<9:$#

I7 IJ KL

M4< &%N

7'DD

7N8A

7D:@

I##8%

OPQR#

7PQR#

Three instructions affected by a single cycle of load delay. Why?

Superscalar: Visualizing the pipeline

Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

18


Limitations of “lockstep” superscalar

Only get 0.5 CPI for a 50/50 mix of float and integer ops with no hazards.

Extending scheme to general mixes and more instructions is complicated.

If one accepts building a complicated machine, there are better ways to do it.

The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer



MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

Next time:DynamicScheduling

19


Recall: Branch Predictors are Caches

0b0110[...]01001000 BNEZ R1 Loop

Branch Target Buffer (BTB)

“Taken” or“Not Taken”

Branch

History Table

(BHT)

Note: BHT

can be larger

than BTB, does

not need a

tag.

10/7 lecture has BHT details

target address

2 bits

“Taken” Address

PC + 4 + Loop

28-bit address tag

=

Hit

0b0110[...]0100

20


Conclusion: Superpipelining, Superscalar

The 5 stage pipeline: a starting point for performance enhancements, abuilding block for multiprocessing.

Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity.

Superpipelining: Reduce critical pathby adding more pipeline stages. Hasthe potential to increase the CPI.

21

superpipelining: add more stages superscalar: multiple...

Documents