a new era in processor evolution dezső sima fall 2007 (ver. 2.2) dezső sima, 2007
TRANSCRIPT
Foreword
Beginning with second generation superscalars, the continuous, approximately 10-fold-per-decade increase of processor efficiency leveled off for reasons shown in Chapter I. Designers responded by massively rising clock frequencies at up to a 100-fold-per-decade rate in order to sustain an approximately 100-fold-per-decade performance increase. Such a rapid progress, however inevitably encountered its limits due to declining processor efficiency, increasing dissipation and skew in parallel buses, as shown in this Chapter. As a consequence, a decade long era of processor evolution, characterized by massively rising clock frequencies, ended in the last few years. The new era is heralded by multicore and multithreaded designs, as discussed in Chapters III. and IV.
Contents
1. Processor performance•
2. Efficiency of processors•
3. Addressing the levelling off of processor efficiency•
4. Aggressively raising clock frequency•
5. The efficiency wall•
6. The thermal wall •
7. The skew wall •
8. EPIC architectures/processors •
9. The end of an era in processor evolution •
Relative performanceAbsolute performance
Number of succesfully executed instructions/sec
effcai IPCfP
Number of succesfully executed operations/sec (SIMD)
OPIIPCfP effcao
Relating the execution times of a benchmark program on the tested system to a reference system according to the following interpretation:
E.g.: SPECint92, SPECint_base2000
1.1. Introduction (1)
fc: Clock frequencyIPC: Instructions/cycleOPI: Operations/cycle
n
nv
nref
v
refr t
t
t
tP
1
1
1.1. Introduction (2)
In general purpose applications:
1OPI
IPCIPCeff
where:IPC : issued instructions per cycleη : number of successfully executed/issued instructions
(efficiency of the speculative execution)
effcaia IPCfPP
In performance/efficiency studies:
Theoretical interpretation: Pa
Practical measurement: Pr
1.1. Introduction (3)
1
2
1
2
r
r
a
a
P
P
P
P?
If the following were true:
v
ref
nv
nref
v
ref
v
ref
t
t
t
t
t
t
t
t ...
2
2
1
1
In that case:
2
1
121
2
v
v
v
ref
v
ref
r
r
t
t
t
t
t
t
P
P
1
2
21 a
a
aa P
P
PI
PI
v
refr t
tP
1.1. Introduction (4)
I: Number of instructions in the application considered
However:
Figure 1.1.: Runtime ratios of the component programs of SPECint2000
Source: http://www.spec.org
1.1. Introduction (5)
When comparing the performance of two systems:
1
2
1
2
r
r
a
a
P
P
P
P
This estimation is useable in trend considerations.
1.1. Introduction (6)
Comparing the efficiency of two systems:
1.1. Introduction (7)
1
2
eff
eff
IPC
IPC
1
1
2
2
c
a
c
a
fP
fP
2
1
1
2
c
c
a
a
f
f
P
P
2
1
1
2
c
c
r
r
f
f
P
P
1
1
2
2
c
r
c
r
fP
fP
1
2
eff
eff
IPC
IPC
1.2. Evolution of processor performance (1)
Figure 1.2: Integer performance growth of Intel’s x86 processors
SPECint92
5
10
50
Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99
*
*
*
**
*
**
2
386/16
*
* *
*
*
* 8088/5
*0.5
100
8088/8
80286/10
80286/12
386/20 386/25
386/33
500
*
*
*1000
20
200
1
0.2
*
***
**
*
486/25
486/33486/50 486-DX2/66
Pentium/66
Pentium/100 Pentium/120
Pentium Pro/200
PII/450
PIII/600
486-DX4/100
Pentium/133 Pentium/166
Pentium/200
PII/300PII/400 PIII/500
486-DX2/50*
2000 01 02 03
5000
2000*
*
*
*
*
** *
*
PIII/1000
P4/1500P4/1700
P4/2000 P4/2200P4/2400 P4/2800
P4/3060
P4/3200
~ 100*/10 years
*
*
***
04 05
Northwood B
10000
Prescott (1M)Prescott (2M)
Levelling off
Figure 1.3: Integer performance growth (in general - 1)
Source: X86-64 Technology White Paper, AMD Inc., Sunnyvale, CA, 2000
1.2. Evolution of processor performance (2)
3.Figure 1.4: Integer performance growth (in general - 2)
Source: F. Labonte, www-vlsi.stanford.edu/group/chart/specInf2000.pdf
1.2. Evolution of processor performance (3)
Figure 2.1: Efficiency of Intel processors
2.2. Growth of processor efficiency (1)
fcSPECint_base2000/
Year79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978 2000 01 02
0.05
0.1
0.02
0.5
1
0.2
0.01~~
*
**
**
* **
** *
Pentium
486DX
386DX
286
Pentium IIPentium Pro
**
Pentium III~10*/10 years
Levelling off
2. generationsuperscalars
Figure 2.2: Growth of processor performance/efficiency (in general)
Source: J. Birnbaum, „Architecture at HP: Two decades of Innovation”, Microprocessor Forum, October 14, 1997.
2.2. Growth of processor efficiency (2)
2.3. Contribution of raising processor efficiency to the growth of processor performance
(up to the 2nd generation of superscalars)
A második generációig az órafrekvencia és a hatékonyság növelése egyenlő arányban járultak hozzá a teljesítmény növeléséhez.
? effca IPCfP
y10/100~ rs y10/10~ rs rsy10/10~
2.4. Sources of raising processor efficiency
Increasing the word length
Introducing and increasing temporal parallelism
Introducing and increasing issue parallelism
8/16 32 bit (286 386DX)
1st and 2nd generation pipeline processors (386DX, 486DX)
1st and 2nd generation superscalars (Pentium, Pentium Pro)
•
•
•
2.5. Limit of raising processor efficiency (1)
Processing width
4 RISC instructions/cycle~3 CISC instructions/cycle
Figure 2.3: Processing width of 2nd generation (wide) superscalars vs extent of parallelism available in general purpose applications
2nd generationsuperscalars
(wide superscalars)
Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990
fcSPECint_base2000/
Year79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978 2000 01 02 03 04 05
Leveling off
~ 10x/ 10 years
2. gen. szuperscalars
0.05
0.1
0.02
0.5
1
0.2
0.01~~
Figure 2.4: Growth of processor efficiency (in general)
2.5. Limit of raising processor efficiency (2)
2.5. Limit of raising processor efficiency (3)
Beginning with 2nd generation (wide) superscalarsthe sources of extensively raising processor efficiency
became exhausted
In general purpose applications:
The width of 2nd generation superscalars already approaches the extent of available parallelism (ILP)
Essentially widening the core by introducing EPIC architectures
Aggresively raising clock frequency
effca IPCfP
Main road of evolution
(Sections 4 – 7) (Section 8)
By reducing the logic depth of pipline stages
By scaling down the feature size in the manufacturing process
4.1. Sources of raising clock frequencies (1)
Raising clock frequency
Figure 4.1: Evolution of Intel’s process technology
Source: D. Bhandarkar: „The Dawn of a New Era”, 11. EMEA, May, 2006.
4.1. Sources of raising clock frequencies (2)
20
30
Year*
10
40
1990 2000
*
* *
*
Pentium(5)
2005
No of pipeline stages
Pentium Pro(~12)
Pentium 4(~20)
Athlon-64(12)
P4 Prescott(~30)
(14)Conroe
*Athlon(6)K6
(6)*
1995
*
Core Duo
Figure 4.2: Number of pipeline stages in Intel’s and AMD’s processors
4.1. Sources of raising clock frequencies (3)
Figure 4.3: Max. logic depth of pipeline stages in processors (in terms of FO4)
Source: F. Labonte www-vlsi.stanford.edu/group/chart/CycleFO4.pdf
4.1. Sources of raising clock frequencies (4)
Figure 4.4: Growth of clock frequencies in Intel’s x86 line of processors
4.2. Growth rate of clock frequencies (1)
5
10
50
Year
*
** *
2
8088
*
100
386
Pentium
Year of first volume shipment
cf
500
1000
20
200
*
486-DX2
79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978
*
*
**
*
*486
*
** *
*
** *
**
Pentium II
***Pentium III
*
286
*
Pentium Pro
1
486-DX4
2000 01 02 03
2000**
***
***
*
*
5000
Pentium 4
~10*/10years
~100*/10years
04 05
* * *
Leveling off(MHz)
Emerging limits of evolution
Ousting of major RISC families
4.3. Implications of aggressively raising clock frequencies
4.3.1 Overview
(4.3.2)
(4.3.3)
•
•
Figure 4.6: The shift in performance leadership between RISC and x86 lines
4.3.2. Ousting of major RISC families (2)
1995-2000: CISCs overtook the performance leadershipthen it is a more intrinsic task to raise fc from a higher value than from a lower one in the same rate
Cancelling of most major RISC lines, such as MIPS’s R-Lines, HP’s Alpha and PA lines,
PowerPC Consortium’s PowerPC line
4.3.2. Ousting of major RISC families (2)
1997: Intel and HP unveiled IA-64/Merced as the next generation architecture/processor line
4.3.3. Emerging limits of evolution
The skew wall
The thermal wall
The efficiency wall
(Section 5)
(Section 6)
(Section 7)
•
•
•
speed gap between the processor and the memory
5.1. Overview
Basic reason:
(it widens on higher frequencies)
Memory transfer rates
DRAM latencies
Transfer rates of processor buses
L2 cache latencies
Main appearances of the speed gap between the processor and the memory:
•
•
•
•
5.1. Overview (2)
5.2. Speed gap between processor and memory (1a)
Figure 5.1a: DRAM types
DRAM FPM EDO BEDO SDRAM DRDRAM
Cycle time within a burst(for a 60 ns part)
Full burst timing
Max. bandwidth MB/s
Effective bandwidth MB/s
Examples
Remakes
Random access,typ. access time60/70/80/100 ns
(60 ns)
(5-5-5-5)
Access to 4subsequentcolumns
Overlapping theread and addresstransfer operations
Internal 2-bitaddress generator,
dual banks
Full pipelinedoperation,
assuming at leastdual banks
66/100/133 MHz
Asynchronous
Burst mode access (4*8B) on the same row (page)
Synchronous
Up to 66 MHz bus frequency
Internal on-chipSRAM cache,page is filled in
1 clock cycle,1-2 B wide data path256/300/356/400MHz transfer rate
~ 40 ns ~ 25 ns ~ 15 ns ~ 15/10/7.5 ns (4/3.3/2.8/2.5 ns)
(5-7)-3-3-3(5-7)-4-4-4
(5-7)-2-2-2 5-1-1-1 (5-7)-1-1-1
Triton I.: 7-3-3-3Triton III.: 6-3-3-3
Triton I.: 7-2-2-2Triton II,III.:6-2-2-2
Triton III.: 7-1-1-1430 ZX.: 7-1-1-1
820840
Developed byMICRON
Developed byRAMBUS
Level of overlapping
Since 1996
Cached structure
1
1
2
2
3
3
4
4
5
5
6
6
Dynamic RAMFast Page Mode DRAMExtended Data Out DRAM
Burst mode EDOSynchronous DRAMDirect Rambus DRAM
5.2. Speed gap between processor and memory (1b)
Figure 5.1b: Latency of DRAM chips
486 DX P PPro PII PIII386 DX
86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99
200
180
160
140
120
100
80
60
40
20
2000
*
PC AT
*
*
* *
**
**
*
*
16 K 128 K 256 K 256 K 4 M 16 M
tRAC
Year
Processorchipset
Typ. DRAMparts
(ns)
430 NX
4 M
4 M
4 M1 M 1 M
8 M
16 M 64 M64 M
16 M64 M 128 M
256 M
200
150
100
80
80
60
70
5060
50
30
450 KX/GX 440 BX 815
tRAC
: Row access time (time from row address until data valid)
128 K256 K
5.2. Speed gap between processor and memory (1c)
Figure 5.1c: System-level memory latency in x86-based PCs
486 DX P PPro PII PIII386 DX
86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99
100
10
1
2000
PC
Year
Processor
Memory latencyin proc. cycles
AT(286)(8088)
P4
50
1000
3020
500
200
23
5
*
*
*
*
10
40
85
702
300
**
*
1 1
3
Memory latencyns
500
400
300
200
100
*
*
**
*
155
135
141
116
468
*200
Latency in ns
Latency in proc.cycles
5.2. Speed gap between processor and memory (1d)
Figure 5.1d: Latency of DRAM chips (in clock cycles)
20
40
30
1.0 2.0fc
1.5 2.50.5
10 *
*
*
*
*
*
*
3.0 3.5
*
4.0
Memory latency
*
*
*
**
60
50
80
70
100
90
Pentium
Pentium Pro
Pentium II
Pentium III Pentium 4
RDRAM-40
120
110
*
*
*
*
**
RDRAM-60 DDR2 533
DDR 400
DDR 333
PC 133
PC 100
PC 66
386
EDO
(cycles)
FPM
130*
DDR 266
486
*
*
(GHz)
Figure 5.2: Relative transfer rate of memories (D: dual channel)
0.20
0.40
0.30
1.0 2.0fc
1.5 2.50.5
0.10
**
*
**
*
*
*
*
*
***
*
3.0 3.5
*
*
*
**
4.0
Tmemory/f c
*
*
*
**
**
*
*
*
** *
*
**
*
0.60
0.50
0.80
0.70
1.00
0.90
Pentium
Pentium Pro
Pentium II
Pentium III Pentium 4
PC-66
PC-100
PC-133
DDR 266
PC-800D
DDR 333
DDR 333D
**
*
*****
*
DDR 400
DDR 400D
DDR 533D
*
*
*
*
*
*
*
*
FPM
EDO
(GHz)
5.2. Speed gap between processor and memory (2)
fc max at intro.
(GHz)L2 size(Kbyte)
L2 latency(clock cycles)
Willamette 1.5 128 7
Northwood 2.0 512 16
Prescott 3.4 1024 23
Figure 5.3: Latency of L2 caches
5.2. Speed gap between processor and memory (3)
Figure 5.4: Relative transfer rates of processor buses
0.20
0.40
0.30
1.0 2.0fc
1.5 2.50.5
0.10
*
*
*
*
*
*
*
**
**
**
*
3.0 3.5
**
*
*
*
4.0
Tpb/f c
*
**
*
*
*
*
**
*
*
*
*
*
**
*
0.60
0.50
0.80
0.70
1.00
0.90
Pentium
Pentium Pro
Pentium II
Pentium III
Pentium 4
66
100
133 400 533
8001066
(GHz)
5.2. Speed gap between processor and memory (4)
5.3. Efficiency of 3rd generation superscalars (1)
5.5: Efficiency of Intel’s Pentium III and Pentium 4 processors in general purpose applications
0.40
0.5
0.45
1.0 2.0fc
1.5 2.50.5
0.35
0.30
**
*
*
*
*
**
*
*
*
*
*
**
*
*
****
0.55
3.0 3.5
***
**
*
*
**
*
**
*
*
4.0
Katmai512K dir L2
Coppermine256K on-die L2
Willamette256K on-die L2
Northwood A512K on-die L2
Prescott (1M)1M on-die L2
Prescott (2M)2M on-die L2
Irwindale512K on-die L2
2M on-die L3
800 MHz/PC-3200/SATA-150/HT800 MHz/PC-3200/ATA-100
100 MHzPC-100
SCSI-U2W
100 MHzPC-100
ATA-100
100 MHzPC-133
ATA-100
400 MHzPC-800 RDRAM
ATA-66
400 MHzPC-800 RDRAM
ATA-66
800 MHzPC-4300
SATA-150
Pentium 4Pentium III
SPECint_base2000/f c
Northwood C512K on-die L2
~~
800 MHz/PC-3200/ATA-100
533 MHzPC-800 RDRAM
ATA-100
800 MHzPC-3200
SATA-150HT
**
*
800 MHz/PC-2667/ATA-100
Northwood B512K on-die L2
* *
(GHz)
Figure 5.6: efficiency of AMD’s Athlon, Athlon XP and Athlon 64 processors in general purpose applications
0.40
0.50
0.45
1.5 2.5fc
2.0 3.01.0
0.35
0.30
* *
*
*
*
**
*
*
**
* *
*
**
*
*
0.65
3.5
**
*Palomino
256K on-die L2
Clawhammer1M on-die L2
Thorougbread256K on-die L2
200 MHzPC-100
ATA-66200 MHzPC-100
ATA-66
200 MHzPC-133
ATA-66
200 MHzPC-133
ATA-66
266 MHzPC-2100
ATA-100
266 MHzPC-2100
ATA-100
333 MHz/PC-2700/ATA-100
Athlon-XP
Athlon
SPECint_base2000/f c
0.5
Barton512K on-die L2
Thunderbird256K on-die L2
400 MHz/PC-3200/ATA-100
PC-3200ATA-133
f =fFSBmemory
K7512K dir L21
K75512K dir L22,3
1 f =0.5*fL2 c 2 f =0.4*f
L2 c (f =750/800/850 MHz)c3 f =0.3*f
L2 c (f =900/950/1000 MHz)c
Athlon 64
~~
0.60
~~
4.0(GHz)
5.3. Efficiency of 3rd generation superscalars (2)
Figure 5.7: Main aspects of the memory subsystem affecting core efficiency
fc
Core efficiency
Decreasing core efficiencydue to broadening Increasing core efficiency
primarily due to enhancing thememory subsystem(memory, FSB, L2)
(GHz)
the memory gap
5.3. Efficiency of 3rd generation superscalars (3)
Figure 5.8: Contrasting the efficiency of Intel’s and AMD’s
processors
0.40
0.50
0.45
1.0 2.0fc
1.5 2.50.5
0.35
0.30
**
*
*
*
*
**
*
*
*
*
*
**
*
*
****
3.0 3.5
***
**
4.0
512K/100
256K/100
256K/400
512K/400
1M/800
2M/800
SPECint_base2000/f c
512K/800
~~
**
*
512K/533
* *
**256K/200
* *
**
*
**512K/200
*
*
**
**
*
256K/266
*
*
512K/400
512K/333
0.65
0.60
***
1M/fFSB
1000
0.55
1200 1400 1600 1800
Pentium III
Pentium IV
Athlon
Athlon XP
Athlon 64
(GHz)
5.3. Efficiency of 3rd generation superscalars (4)
Figure 5.9: Contrasting Intel’s and AMD’s processor design philosophies
0.40
0.50
0.45
0.35
**
*
*
*
*
**
*
*
*
*
*
**
*
*
****
***
**
512K/100
256K/100
256K/400
512K/400
1M/800
2M/800
SPECint_base2000/f c
512K/800
~~
**
*
512K/533
* *
**
256K/200
* *
**
*
**512K/200
*
*
**
**
*
256K/266
*
*
512K/400
512K/333
0.65
0.60
***
1M/fFSB
1000
0.55
1200 1400 1600 1800
Designs preferringcore efficiency
Designs preferring clock frequency
1.0 2.0fc
1.5 2.50.5 3.0 3.5 4.0(GHz)
0.75
0.70
0.80
*2M/400
Pentium III
Pentium IV
Athlon
Athlon XP
Athlon 64
Pentium M
5.3. Efficiency of 3rd generation superscalars (5)
Diminishing return on higher clock frequencies
Implication of the emerging efficiency wall:
5.3. Efficiency of 3rd generation superscalars (6)
6. The thermal wall (1)
Dissipation (D) :
Dd=A*C*V2*fc
withA: ratio of the active gates
C: effective capacity of the gates
V: supply voltage
fc: clock frequency
Ileak: leakage current
Dynamic Static
Ds=V*Ileak
6. The thermal wall (2)
Figure 6.1:Chip dynamic and static power dissipation trends
Source: N. S. Kim et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer, Dec. 2003, pp. 68-75.
Source:Solie D., „Technology Trends, Aug. 2006, http://www-03.ibm.com/procurement/proweb.nsf/objectdocswebview/file14+-+darryl+solie+-+ibm+power+symposium+presentation/$file/14+-+darryl+solie-ibm-power+symposium+presentation+v2.pdf
Figure 6.2: Dynamic and static power dissipation trends
Figure 6.3: Relative dissipation of Intel’s x86 family of processors
5
10
50
100
20
2
100 1000 5000
*
*
*
*
**
****
*
*
*
*
**
*
*
* **
*
*
*
Prescott
Northwood
Willamette
Tualatin
Coppermine
Katmai
Deshutes
Klamath
P6
P54CS
P54C
P5
*
(W/cm )2
fc2000200 50020 50
D/die area
0.8μ 0.6μ
0.6μ
0.35μ
0.35μ
0.35μ
0.25μ
0.25μ
0.18μ
0.18μ
0.13μ0.13μ
0.09μ
(MHz)
6. The thermal wall (3)
Figure 6.4: Contrasting the evolution of Intel’s and AMD’s processor lines with the thermal wall
0.40
0.50
0.45
0.35
**
*
*
*
*
**
*
*
*
*
*
**
*
*
****
***
**
512K/100
256K/100
256K/400
512K/400
1M/800
2M/800
SPECint_base2000/f c
512K/800
~~
**
*
512K/533
* *
**
256K/200
* *
**
*
**512K/200
*
*
**
**
*
256K/266
*
*
512K/400
512K/333
0.65
0.60
***
1M/fFSB
0.55
Thermal
wall
Core design,
technology
1.0 2.0fc
1.5 2.50.5 3.0 3.5 4.0 ~~(GHz)
1000 1200 1400 1600 18000.80
*2M/400
0.75
0.70
Pentium III
Pentium IV
Athlon
Athlon XP
Athlon 64
Pentium M
6. The thermal wall (4)
11/00 1/02
^
0.18 /42 mtrs
^
400 MHz FSB
Northwood-A
Xeon DP line
Desktop-line
Celeron-line
Willamette
1.4/1.5 GHz
(Value PC-s)
On-die 256K L2
0.13 /55 mtrs
400 MHz FSB
2A/2.2 GHzOn-die 512K L2
2/02
^
0.13 /55 mtrs
400 MHz FSB
1.8/2/2.2 GHz
On-die 512K L2
5/01
^
0.18 /42 mtrs
400 MHz FSB
1.4/1.5/1.7 GHz
On-die 256 K L2
11/02
^Prestonia-B
0.13 /55 mtrs
533 MHz FSB
2/2.4/2.6/2.8 GHz
On-die 512K L2
Foster Prestonia-A Nocona
2/04
^
0.09 /125mtrs
800 MHz FSB
2.80E/3E/3.20E/3.40E GHzOn-die 1M L2
2000 2001 2002 2003 2004
Xeon - MP line
3/02
^
0.18 /108 mtrs
400 MHz FSB
1.4/1.5/1.6 GHz
On-die 256K L2
11/02
^Gallatin
0.13 /178 mtrs
400 MHz FSB
1.5/1.9/2 GHz
On-die 512K L2
Foster-MP
On-die 512K/1M L3 On-die 1M/2M L3
5/02
^Northwood-B
0.13 /55 mtrs
533 MHz FSB
2.26/2.40B/2.53 GHzOn-die 512K L2
5/02^
Willamette-128
400 MHz FSB
1.7 GHz
11/02
^
6/04
^
0.09 / 125 mtrs
800 MHz FSB
2.8/3.0/3.2/3.4/3.6 GHz
On-die 1M L2
Northwood-B
533 MHz FSB
3.06 GHzOn-die 512K L2
0.13 /55 mtrs
400 MHz FSB
2 GHzOn-die 128K L2
0.18 0.13
9/02
^Northwood-128
On-die 128K L2
Cores supporting hyperthreading
5/03
^Northwood-C
800 MHz FSB
2.40C/2.60C/2.80C GHzOn-die 512K L2
0.13 /55 mtrs
Cores with EM64T implemented but not enabled
2005
2Q/05
^Potomac
0.09 > 3.5 MHz
On-die 1M L2On-die 8M L3 (?)
Irwindale-C
1Q/05
^
0.09 3.0/3.2/3.4/3.6 GHz
On-die 512K L2, 2M L3
Jayhawk
2Q/05
^
0.09
(Cancelled 5/04)
3.8 GHz
On-die 1M L2
3Q/05
^Tejas
0.09 /4.0/4.2 GHz
On-die 1M L2(Cancelled 5/04)
Irwindale-A
11/03
^
800 MHz FSB
3.2EE GHz
On-die 512K L2, 2M L3
0.13 /178 mtrs
Cores supporting EM64T
6/04
^
0.09 /125mtrs
800 MHz FSB
2.8/3.0/3.2/3.4/3.6 GHz
On-die 1M L2
11/04
^Irwindale-B
0.13 /178mtrs
1066 MHz FSB
3.4EE GHzOn-die 512K L2, 2 MB L3
533 MHz FSB
2.4/2.53/2.66/2.8 GHzOn-die 256K L2
0.09
6/04
^Celeron-D
PGA 603 PGA 603
PGA 603 PGA 604
PGA 478 LGA 775
PGA 423 PGA 478 PGA 478 PGA 478 PGA 478 PGA 478 LGA 775
PGA 478 PGA 478
PGA 603 PGA 603
0.18 /42 mtrs
^
400 MHz FSB
Willamette
On-die 256K L2
PGA 478
3/04
^Gallatin
0.13 /286 mtrs
400 MHz FSB
2.2/2.7/3.0 GHz
On-die 512K L2On-die 2M/4M L3
PGA 603
8/01
PGA 478533 MHz FSB
2.53/2.66/2.80/2.93 GHzOn-die 256K L2
0.09
9/04
^Celeron-D
Extreme Edition
7/03
^Prestonia-C
0.13 /178 mtrs
533 MHz FSB
3.06 GHz
On-die 512K L2, 1M L3
PGA 603
1.4 ... 2.0 GHz0.09 /125mtrs
800 MHz FSB
3.20F/3.40F/3.60F GHz
On-die 1M L2
LGA 775
8/04
^
12 13
8,9,10PrescottPrescott Prescott-F115 6,7
LGA 775
42,3
1 1
Figure 6.5: Intel’s P4 processor family (Netburst architecture)
6. The thermal wall (5)
Figure 6.6: The growth of relative dissipation of processors (in general)Source: R Hetherington, „The UltraSPARC T1 Processor” White Paper, Sun Inc., 2005
6. The thermal wall (6)
Implications of the thermal wall:
6. The thermal wall (7)
Processor designs focus now more and more on power aware technics
The approach to increase performance by aggressively raising clock frequency met the
thermal wall
Figure 7.2: Equalizing skews among different bit lines of the
processor bus on the MSI 915G Combo motherboard
7. The skew wall (2)
7. The skew wall (3)
Introducing sequential buses
Figure 7.3: Signal transfer over a sequential bus
D+
D-
"0" "1"
(also in slow peripheral buses due to impressive cost savings)
Implication of emerging skews between bit lines of parallel buses:
Implication of emerging limits of evolution
The approach to aggressively raise clock frequencies met the efficiency, thermal and skew walls
and thus hit the dead end
8. EPIC architectures/processors (1)
Essentially widening the core by introducing EPIC architectures
Aggresively raising clock frequency
effca IPCfP
Main road of evolution
(Sections 4 – 7) (Section 8)
Instructions
Principle of superscalar processing
FE
FE
FE
dynamicdependency resolution
Processor
dependent instructions
Principle ofVLIW processing
FE
FE
FE
VLIW: Very Large Instruction Word
independent instructions(static dependency resolution)
Processor
Figure 8.1: Contrasting the principles of operation of superscalar and VLIW processors
8. EPIC architectures/processors (2)
VLIW EPIC
EPIC: Explicitly Parallel Instruction Computer
enhanced VLIW
• branch prediction• explicit cache control• •
(integration of advanced superscalar features)
8. EPIC architectures/processors (3)
1994: Intel, HP
2001: IA-64 Itanium
1997:EPIC designation
5/01 6/03
^^Itanium 2Itanium
4/04
^Itanium 2
11/04
^Itanium 2
7/05
^Itanium 2
2001 2002 2003 2004 2005
^Itanium 2
7/02
0.18 /25 mtrs
64-bit FSB
733/800 MHz96K L2
266 MT/s
2/4M L3
0.18 /220 mtrs
128-bit FSB
800/1000 MHz256K L2
400 MT/s
1,5/3M L3
0.13 /410 mtrs
128-bit FSB
1.5 GHz256K L2
400 MT/s
6M L3
0.13 /410 mtrs
128-bit FSB
1.4/1.6 GHz256K L2
400 MT/s
3M L3
0.13 /592 mtrs
128-bit FSB
1.5/1.6 GHz256K L2
400 MT/s
3/4/6/9M L3128-bit FSB
1.66 GHz256K L2
667 MT/s
6/9M L3
9/03
^Itanium 2
(Merced) (Mc Kinley) (Madison)(Madison)
1
11
2
21.5 GHz with 4 MB L3
1.6 GHz with 3/6/9 MB L3
400 MT/s for 4/6/9 MB L3 GHz with 4 MB L3400/533 MT/s for 3 MB L3
Multiprocessor(MP-line)
Dual processor(DP-line)
0.13 /410 mtrs
128-bit FSB
1.4 GHz256K L2
400 MT/s
1.5M L3
(Madison)
Year
(Madison) (Madison)
Figure 8.2: Overview of Itanium cores
8. EPIC architectures/processors (4)
0.5
0.7
0.6
1000 2000fc
1500500
0.4*
*
0.9
0.8
1.0
Itanium
Itanium 2
64-bit FSB/266 MT/s
*
**
*
*
*
(MHz)~~~~
SPECint_base2000/f c
128-bit FSB/400 MT/s
96K L2/4M dir. L3
96K L2/2M dir. L3
256K L2/9M L3/DDR 266256K L2/6M L3/DDR 266
256K L2/3M L3/DDR 266
Figure 8.3: The efficiency of Itanium processors
8. EPIC architectures/processors (5)
Figure 8.4: Expected spreading of the IA-64 architecture (Itanium processors)
Source: L. Gwennap: Intel’s Itanium and IA-64: Technology and Market Forecast, MDR, 2000
8. EPIC architectures/processors (6)
Figure 8.5: Revenue expectations concerning Intel’s Itanium line
8. EPIC architectures/processors (7)
In general purpose applications:EPIC architectures/processors
play a decreasing role
8. EPIC architectures/processors (8)
9. The end of an era in processor evolution (1)
In general purpose applicationsbeginning with the 2. generation superscalars
processor efficiency leveled off,but both approaches to address leveling off efficiency
met limits of evolution and thus hit the dead end
Single core complex superscalars, –
at the end of an era
9. The end of an era in processor evolution (2)
A new era in processor evolution–
The dawn of multicore, multithreded processors
The number of processors will double also in each ~ 24 months
Available hardware complexity increases further on exponentially
(Moore’s law)
Complexity is doubled in each ~ 24 moths