superpipelining: add more stages superscalar: multiple...
TRANSCRIPT
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
2004-11-09
Dave Patterson
(www.cs.berkeley.edu/~patterson)
John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 20 – Advanced Processors I
www-inst.eecs.berkeley.edu/~cs152/
1
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Five optionsfor multiplierlatency ...
1-cycle optionis fully spatial.
35-cycle ismini-Lab 2.
2, 4, 5 cycles?
Last Time: Multipliers -- Space vs. Time
2
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Today: Beyond the 5-stage pipeline
Taxonomy: Introduction to advanced processor techniques.
Superpipelining: Increasing the number of pipeline stages.
Superscalar: Issuing several instructions in a single cycle.
3
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
5 Stage Pipeline: A point of departure
CS 152 L10 Pipeline Intro (9) Fall 2004 © UC Regents
Graphically Representing MIPS Pipeline
Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?
ALUIM Reg DM Reg
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
At best, the 5-stage pipeline
executes one instruction per
clock, with a clock period
determined by the slowest stage
Filling all delay slots(branch,load)
Perfect
caching
Application does not need multi-cycle instructions (multiply, divide, etc)
4
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Superpipelining: Add more stages Today!
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Reduce critical path by
adding more pipeline stages.
Difficulties: Added penalties for
load delays and branch misses.
Ultimate Limiter: As logic delay
goes to 0, FF clk-to-Q and setup.
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Example: 8-stage ARM XScale:
extra IF, ID, data cache stages.
5
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Improve CPI by issuing
several instructions per cycle.
Difficulties: Load and branch
delays affect more instructions.
Ultimate Limiter: Programs may
be a poor match to issue rules.
!"#$%&
!"#$%
&'"()*+,-*.,,/
012.3-*4++556
789($:;9*<9:$*=
)'"'($%":#$:(#
>8#?
>8#?
.*(?(
.*(?(
+(?(+(?(+(?(
!"##$
%&%'#&(')
%*+,&*##$
%&%'#&(')
789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#
!"";B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*
%9$%"#*'*F89($:;9*89:$*
!"":9B8$#*$;*'*F89($:;9*89:$*G%
1C1-*"%C:#$%"*F:A%
H
('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9
'((%B$
'((%B$
!"#$%
&'"()*+,-*.,,/
012.3-*4++550
&8A$:BA%*789($:;9*<9:$#
I7IJ
KL
M4<
&%N
7'DD
7N8A
7D:@
I##8%
OPQR#
7PQR#
Example: CPU with floating
point ALUs: issue 1 FP + 1
integer instruction per cycle.
Superscalar: Multiple issues per cycle Today!
6
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Out of Order: Going around stalls Next
Tuesday
Goal: Issue instructions out of program order
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
!"#$%&
!"#$%
&'"()*+,-*.//0
123.4-*5+.664
7869":%"*7##;%*5<=<$'$<>8#? !"#$%!&'($
!"#$%&'( !" #$%& '()*$+ (
) !" #(% (,)*'+ !*%+
, -.!/" #0% #(% #$ ,
- 1.2" #3% #$% #$ (
. "45" #(% #$% #3 -
/ 6""" #78% #0% #( (
786>":%"? +*@.-+A*2**2**2**2**2**2**. 4*0*0 4 ,*2**2**2*, 1*1
( )
,-
.
/
49:;<=><&<>?@<AB@A;9&C<>D>9@?&A9?@<EB@A;9&(&F<;G&H>A9I&=A?CJ@BK>=
!"#$%
&'"()*+,-*.//0
123.4-*5+.660
9;$6>B69":%"*C<#D'$()
7E 7C FG
H5I &%=
E'::
E=;J
7##;%
' 7##;%*#$'K%*L;BB%"*)>J:#*=;J$<DJ%*<8#$";($<>8#*M'<$<8K*
$>*<##;%2
' C%(>:%*'::#*8%N$*<8#$";($<>8*$>*L;BB%"*<B*$)%"%*<#
#D'(%*'8:*$)%*<8#$";($<>8*:>%#*8>$*(';#%*'*FHO*
>"*FHF*)'P'":2
' H8Q*<8#$";($<>8*<8*L;BB%"*M)>#%*OHF*)'P'":#*'"%
#'$<#B<%:*('8*L%*:<#D'$()%:*)*+,#"+-.#!/#&+0/#+"$
120'!/34#'$,#353($62*98*'*M"<$%*L'(R*@FGA-*8%M
<8#$";($<>8#*='Q*K%$*%8'LJ%:2
ADDD
Example: MULTDwaiting on F4 to load ...
... so let ADDD gofirst
Difficulties: Bookkeeping is highly complex.
A poor fit for lockstep instruction scheduling.
Ultimate Limiter: The amount of instruction
level parallelism present in an application.
7
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Dynamic Scheduling: End lockstep Next
Tuesday
Goal: Enable out-of-order by breaking
pipeline in two: fetch and execution.
Limiters: Design complexity,
instruction level parallelism.
Example: IBM Power 5:
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
8
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Throughput and multiple threads Next
Thursday
Goal: Use multiple CPUs (real and virtual) to
improve (1) throughput of machines that run
many programs (2) execution time of multi-
threaded programs.
Difficulties: Gaining full advantage requires
rewriting applications, OS, libraries.
Ultimate limiter: Amdahl’s law, memory
system performance.
Example: Sun
Niagara (8 SPARCs
on one chip).
9
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Administrivia: No class on Thursday!
HW 4: due Weds 11/10, 5PM, 283 Soda.
Friday 11/12: Lab 4 final demo in section.
Monday 11/15: Lab 4 final report due, 11:59 PM.
Final project (Lab 5) will be out soon
10
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Administrivia: Mid-term and Field Trip
Xilinx field trip date: 11/30. Details on bus transport from Soda Hall soon.
Mid-Term II: Tuesday, 11/23, 5:30 to 8:30 PM, 101 Morgan.
Mid-Term II Review Session: Sunday, 11/21, 7-9 PM, 306 Soda.
Thanksgiving Holiday!
11
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Superpipelining
12
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Add pipeline stages, reduce clock period
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Q. Could adding pipeline stages
reduce CPI for an application?
ARM XScale
8 stages
CPI Problem Possible Solution
Extra branch delays Branch prediction
Extra load delays Optimize code
Structural hazards
Optimize code, add hardware
A. Yes, due to these problems:
13
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Hardware limits to superpipelining?
!"#$%&'()*#+&$,-
./#+&$,-0(,#$.&"12-13 456756887 9,#$.&"1):$';-"(',<
!"#$%&'(&)*+
8
=8
68
78
48
>8
?8
@8
A8
B8
=88
A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= 86 87 84 8>
'$,-/)7A?
'$,-/)4A?
'$,-/)C-$,'3D
'$,-/)C-$,'3D)6
'$,-/)C-$,'3D)7
'$,-/)C-$,'3D)4
'$,-/)',#$'3D
E/CF#)6=8?4
E/CF#)6==?4
E/CF#)6=6?4
9C#"%
93C-"9C#"%
9C#"%?4
G'C(
HI)IE
I&J-")IK
EGL)M?
EGL)M@
EGL)NA?O?4
Thanks to Francois Labonte, Stanford
FO4
Delays
FO4: How many fanout-of-4 inverter delays in the clock period.
Historicallimit:about12
CPU Clock Periods1985-2005
MIPS 20005 stages
Pentium 420 stages
Pentium Pro10 stages
14
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Is there an optimal pipeline depth?
6 FO4 delays
Methodology: Simulate standard benchmarks on many different pipeline designs.
Source: “The Optimal Logic Depth Per Pipeline Stage is 6 to
8 FO4 Inverter Delays”. M.S. Hrishikesh et al. ISCA 2002.
2 4 6 8 10 12 14 16
Useful logic per stage (FO4)
0.0
0.5
1.0
1.5
Per
form
an
ce (
BIP
S)
Vector FP
Integer
Non-vector FP
2 4 6 8 10 12 14 16
Useful logic per stage (FO4)
0.0
0.5
1.0
1.5
Per
form
an
ce (
BIP
S)
Vector FP
Integer
Non-vector FP
(a) (b)
Figure 4: In-order pipeline performance with and without latch overhead. Figure 4a shows that when there is no latch overhead performance
improves as pipeline depth is increased. When latch and clock overheads are considered, maximum performance is obtained with 6 FO4 useful
logic per stage , as shown in Figure 4b.
chitectural components can be perfectly pipelined and be par-
titioned into an arbitrary number of stages.
4.1 In-order Issue Processors
Figure 4a shows the harmonic mean of the performance of
SPEC 2000 benchmarks for an in-order pipeline, if there were
no overheads associated with pipelining ( = 0) and
performance was inhibited by only the data and control depen-
dencies in the benchmark. The x-axis in Figure 4a represents
and the y-axis shows performance in billions of instruc-
tions per second (BIPS). Performance was computed as a prod-
uct of IPC and the clock frequency—equal to 1/ . The in-
teger benchmarks have a lower overall performance compared
to the vector floating-point (FP) benchmarks. The vector FP
benchmarks are representative of scientific code that operate
on large matrices and have more ILP than the integer bench-
marks. Therefore, even though the execution core has just two
floating-point units, the vector benchmarks out perform the in-
teger benchmarks. The non-vector FP benchmarks represent
scientific workloads of a different nature, such as numerical
analysis and molecular dynamics. They have less ILP than
the vector benchmarks, and consequently their performance
is lower than both the integer and floating-point benchmarks.
For all three sets of benchmarks, doubling the clock frequency
does not double the performance. When is reduced from
8 to 4 FO4, the ideal improvement in performance is 100%.
However, for the integer benchmarks the improvement is only
18%. As is further decreased, the improvement in per-
formance deviates further from the ideal value.
Figure 4b shows performance of the in-order pipeline with
set to 1.8 FO4. Unlike in Figure 4a, in this graph the
clock frequency is determined by 1/( + ). For
example, at the point in the graph where is equal to 8
FO4, the clock frequency is 1/(10 FO4). Observe that max-
imum performance is obtained when corresponds to 6
FO4. In this experiment, when is reduced from 10 to 6
FO4 the improvement in performance is only about 9% com-
pared to a clock frequency improvement of 50%.
4.2 Comparison with the CRAY-1S
Kunkel and Smith [9] observed for the Cray-1S that maximum
performance can be achieved with 8 gate levels of useful logic
per stage for scalar benchmarks and 4 gate levels for vector
benchmarks. If the Cray-1S were to be designed in CMOS
logic today, the equivalent latency of one logic level would
be about 1.36 FO4, as derived in Appendix A. For the Cray-
1S computer this equivalent would place the optimal at
10.9 FO4 for scalar and 5.4 FO4 for vector benchmarks. The
optimal for vector benchmarks has remained more or
less unchanged, largely because the vector benchmarks have
ample ILP, which is exploited sufficiently well by both the in-
order superscalar pipeline and the Cray-1S. The optimal
for integer benchmarks has more than halved since the time of
the Cray-1S processor, which means that a processor designed
using modern techniques can be clocked at more than twice
the frequency.
One reason for the decrease in the optimal of inte-
ger benchmarks is that in modern pipelines average memory
access latencies are lower, due to on-chip caches. The Alpha
21264 has a two-level cache hierarchy comprising of a 3-cycle,
level-1 data cache and an off-chip unified level-2 cache. In
the Cray-1S all loads and stores directly accessed a 12-cycle
memory. Integer benchmarks have a large number of depen-
dencies, and any instruction dependent on loads would stall
the pipeline for 12 cycles. With performance bottlenecks in
the memory system, increasing clock frequency by pipelining
more deeply does not improve performance. We examined the
effect of scaling a superscalar, in-order pipeline with a mem-
ory system similar to the CRAY-1S (12 cycle access memory
access, no caches) and found that the optimal was 11
FO4 for integer benchmarks.
A second reason for the decrease in optimal is the
change in implementation technology. Kunkel and Smith as-
sumed the processor was implemented using many chips at
relatively small levels of integration, without binning of parts
to reduce manufacturer’s worst case delay variations. Con-
sequently, they assumed overheads due to latches, data, and
clock skew that were as much as 2.5 gate delays [9] (3.4 FO4).
benchmark
benchmark
benchmark
In Resources section of class website
15
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Superscalar
16
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
!"#$%&
!"#$%
&'"()*+,-*.,,/
012.3-*4++556
789($:;9*<9:$*=)'"'($%":#$:(#
>8#?
>8#?
.*(?( .*(?(
+(?( +(?( +(?(
!"##$
%&%'#&(')
%*+,&*##$
%&%'#&(')
789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#
!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*
%9$%"#*'*F89($:;9*89:$*
!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H
('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9
'((%B$
'((%B$
!"#$%
&'"()*+,-*.,,/
012.3-*4++550
&8A$:BA%*789($:;9*<9:$#
I7 IJ KL
M4< &%N
7'DD
7N8A
7D:@
I##8%
OPQR#
7PQR#
Example: Superscalar MIPS. Fetches 2
instructions at a time. If first integer and
second floating point, issue in same cycle
Superscalar: A simple example ...
Integer instruction FP instruction
LD F0,0(R1)LD F6,-8(R1)LD F10,-16(R1) ADDD F4,F0,F2LD F14,-24(R1) ADDD F8,F6,F2LD F18,-32(R1) ADDD F12,F10,F2SD 0(R1),F4 ADDD F16,F14,F2SD -8(R1),F8 ADDD F20,F18,F2SD -16(R1),F12SD -24(R1),F16
Two issuesper cycle
One issueper cycle
17
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
!"#$%&
!"#$%
&'"()*+,-*.,,/
012.3-*4++556
789($:;9*<9:$*=)'"'($%":#$:(#
>8#?
>8#?
.*(?( .*(?(
+(?( +(?( +(?(
!"##$
%&%'#&(')
%*+,&*##$
%&%'#&(')
789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#
!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*
%9$%"#*'*F89($:;9*89:$*
!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H
('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9
'((%B$
'((%B$
!"#$%
&'"()*+,-*.,,/
012.3-*4++550
&8A$:BA%*789($:;9*<9:$#
I7 IJ KL
M4< &%N
7'DD
7N8A
7D:@
I##8%
OPQR#
7PQR#
Three instructions affected by a single cycle of load delay. Why?
Superscalar: Visualizing the pipeline
Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB
18
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Limitations of “lockstep” superscalar
Only get 0.5 CPI for a 50/50 mix of float and integer ops with no hazards.
Extending scheme to general mixes and more instructions is complicated.
If one accepts building a complicated machine, there are better ways to do it.
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
Next time:DynamicScheduling
19
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Recall: Branch Predictors are Caches
0b0110[...]01001000 BNEZ R1 Loop
Branch Target Buffer (BTB)
“Taken” or“Not Taken”
Branch
History Table
(BHT)
Note: BHT
can be larger
than BTB, does
not need a
tag.
10/7 lecture has BHT details
target address
2 bits
“Taken” Address
PC + 4 + Loop
28-bit address tag
=
Hit
0b0110[...]0100
20
UC Regents Fall 2004 © UCBCS 152 L20: Advanced Processors I
Conclusion: Superpipelining, Superscalar
The 5 stage pipeline: a starting point for performance enhancements, abuilding block for multiprocessing.
Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity.
Superpipelining: Reduce critical pathby adding more pipeline stages. Hasthe potential to increase the CPI.
21