a new era in processor evolution dezső sima fall 2007 (ver. 2.2) dezső sima, 2007

A New Era in Processor Evolution

Dezső Sima

Fall 2007

(Ver. 2.2) Dezső Sima, 2007

Foreword

Beginning with second generation superscalars, the continuous, approximately 10-fold-per-decade increase of processor efficiency leveled off for reasons shown in Chapter I. Designers responded by massively rising clock frequencies at up to a 100-fold-per-decade rate in order to sustain an approximately 100-fold-per-decade performance increase. Such a rapid progress, however inevitably encountered its limits due to declining processor efficiency, increasing dissipation and skew in parallel buses, as shown in this Chapter. As a consequence, a decade long era of processor evolution, characterized by massively rising clock frequencies, ended in the last few years. The new era is heralded by multicore and multithreaded designs, as discussed in Chapters III. and IV.

Contents

1. Processor performance•

2. Efficiency of processors•

3. Addressing the levelling off of processor efficiency•

4. Aggressively raising clock frequency•

5. The efficiency wall•

6. The thermal wall •

7. The skew wall •

8. EPIC architectures/processors •

9. The end of an era in processor evolution •

1. Processor Performance

Relative performanceAbsolute performance

Number of succesfully executed instructions/sec

effcai IPCfP

Number of succesfully executed operations/sec (SIMD)

OPIIPCfP effcao

Relating the execution times of a benchmark program on the tested system to a reference system according to the following interpretation:

E.g.: SPECint92, SPECint_base2000

1.1. Introduction (1)

fc: Clock frequencyIPC: Instructions/cycleOPI: Operations/cycle

n

nv

nref

v

refr t

t

t

tP

1

1


In general purpose applications:

1OPI

IPCIPCeff

where:IPC : issued instructions per cycleη : number of successfully executed/issued instructions

(efficiency of the speculative execution)

effcaia IPCfPP

In performance/efficiency studies:

Theoretical interpretation: Pa

Practical measurement: Pr


1

2

1

2

r

r

a

a

P

P

P

P?

If the following were true:

v

ref

nv

nref

v

ref

v

ref

t

t

t

t

t

t

t

t ...

2

2

1

1

In that case:

2

1

121

2

v

v

v

ref

v

ref

r

r

t

t

t

t

t

t

P

P

1

2

21 a

a

aa P

P

PI

PI

v

refr t

tP


I: Number of instructions in the application considered

However:

Figure 1.1.: Runtime ratios of the component programs of SPECint2000

Source: http://www.spec.org


When comparing the performance of two systems:

1

2

1

2

r

r

a

a

P

P

P

P

This estimation is useable in trend considerations.


Comparing the efficiency of two systems:


1

2

eff

eff

IPC

IPC

1

1

2

2

c

a

c

a

fP

fP

2

1

1

2

c

c

a

a

f

f

P

P

2

1

1

2

c

c

r

r

f

f

P

P

1

1

2

2

c

r

c

r

fP

fP

1

2

eff

eff

IPC

IPC

1.2. Evolution of processor performance (1)

Figure 1.2: Integer performance growth of Intel’s x86 processors

SPECint92

5

10

50

Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99

*

*

*

**

*

**

2

386/16

*

* *

*

*

* 8088/5

*0.5

100

8088/8

80286/10

80286/12

386/20 386/25

386/33

500

*

*

*1000

20

200

1

0.2

*

***

**

*

486/25

486/33486/50 486-DX2/66

Pentium/66

Pentium/100 Pentium/120

Pentium Pro/200

PII/450

PIII/600

486-DX4/100

Pentium/133 Pentium/166

Pentium/200

PII/300PII/400 PIII/500

486-DX2/50*

2000 01 02 03

5000

2000*

*

*

*

*

** *

*

PIII/1000

P4/1500P4/1700

P4/2000 P4/2200P4/2400 P4/2800

P4/3060

P4/3200

~ 100*/10 years

*

*

***

04 05

Northwood B

10000

Prescott (1M)Prescott (2M)

Levelling off

Figure 1.3: Integer performance growth (in general - 1)

Source: X86-64 Technology White Paper, AMD Inc., Sunnyvale, CA, 2000


3.Figure 1.4: Integer performance growth (in general - 2)

Source: F. Labonte, www-vlsi.stanford.edu/group/chart/specInf2000.pdf


2. Efficiency of processors

effca IPCfP

2.1. Introduction

?rsy10/100~

Figure 2.1: Efficiency of Intel processors

2.2. Growth of processor efficiency (1)

fcSPECint_base2000/

Year79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978 2000 01 02

0.05

0.1

0.02

0.5

1

0.2

0.01~~

*

**

**

* **

** *

Pentium

486DX

386DX

286

Pentium IIPentium Pro

**

Pentium III~10*/10 years

Levelling off

2. generationsuperscalars

Figure 2.2: Growth of processor performance/efficiency (in general)

Source: J. Birnbaum, „Architecture at HP: Two decades of Innovation”, Microprocessor Forum, October 14, 1997.

2.2. Growth of processor efficiency (2)

2.3. Contribution of raising processor efficiency to the growth of processor performance

(up to the 2nd generation of superscalars)

A második generációig az órafrekvencia és a hatékonyság növelése egyenlő arányban járultak hozzá a teljesítmény növeléséhez.

? effca IPCfP

y10/100~ rs y10/10~ rs rsy10/10~

2.4. Sources of raising processor efficiency

Increasing the word length

Introducing and increasing temporal parallelism

Introducing and increasing issue parallelism

8/16 32 bit (286 386DX)

1st and 2nd generation pipeline processors (386DX, 486DX)

1st and 2nd generation superscalars (Pentium, Pentium Pro)

•

•

•

2.5. Limit of raising processor efficiency (1)

Processing width

4 RISC instructions/cycle~3 CISC instructions/cycle

Figure 2.3: Processing width of 2nd generation (wide) superscalars vs extent of parallelism available in general purpose applications

2nd generationsuperscalars

(wide superscalars)

Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

fcSPECint_base2000/

Year79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978 2000 01 02 03 04 05

Leveling off

~ 10x/ 10 years

2. gen. szuperscalars

0.05

0.1

0.02

0.5

1

0.2

0.01~~

Figure 2.4: Growth of processor efficiency (in general)



Beginning with 2nd generation (wide) superscalarsthe sources of extensively raising processor efficiency

became exhausted

In general purpose applications:

The width of 2nd generation superscalars already approaches the extent of available parallelism (ILP)

3. Addressing the levelling off of processor efficiency

Essentially widening the core by introducing EPIC architectures

Aggresively raising clock frequency

effca IPCfP

Main road of evolution

(Sections 4 – 7) (Section 8)

4. Aggressively raising clock frequency

By reducing the logic depth of pipline stages

By scaling down the feature size in the manufacturing process

4.1. Sources of raising clock frequencies (1)

Raising clock frequency

Figure 4.1: Evolution of Intel’s process technology

Source: D. Bhandarkar: „The Dawn of a New Era”, 11. EMEA, May, 2006.


20

30

Year*

10

40

1990 2000

*

* *

*

Pentium(5)

2005

No of pipeline stages

Pentium Pro(~12)

Pentium 4(~20)

Athlon-64(12)

P4 Prescott(~30)

(14)Conroe

*Athlon(6)K6

(6)*

1995

*

Core Duo

Figure 4.2: Number of pipeline stages in Intel’s and AMD’s processors


Figure 4.3: Max. logic depth of pipeline stages in processors (in terms of FO4)

Source: F. Labonte www-vlsi.stanford.edu/group/chart/CycleFO4.pdf


Figure 4.4: Growth of clock frequencies in Intel’s x86 line of processors

4.2. Growth rate of clock frequencies (1)

5

10

50

Year

*

** *

2

8088

*

100

386

Pentium

Year of first volume shipment

cf

500

1000

20

200

*

486-DX2

79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978

*

*

**

*

*486

*

** *

*

** *

**

Pentium II

***Pentium III

*

286

*

Pentium Pro

1

486-DX4

2000 01 02 03

2000**

***

***

*

*

5000

Pentium 4

~10*/10years

~100*/10years

04 05

* * *

Leveling off(MHz)

Figure 4.5: Growth of clock frequencies (in general)

4.2. Growth rate of clock frequencies (2)

Emerging limits of evolution

Ousting of major RISC families

4.3. Implications of aggressively raising clock frequencies

4.3.1 Overview

(4.3.2)

(4.3.3)

•

•

Figure 4.6: The shift in performance leadership between RISC and x86 lines

4.3.2. Ousting of major RISC families (2)

1995-2000: CISCs overtook the performance leadershipthen it is a more intrinsic task to raise fc from a higher value than from a lower one in the same rate

Cancelling of most major RISC lines, such as MIPS’s R-Lines, HP’s Alpha and PA lines,

PowerPC Consortium’s PowerPC line

4.3.2. Ousting of major RISC families (2)

1997: Intel and HP unveiled IA-64/Merced as the next generation architecture/processor line

4.3.3. Emerging limits of evolution

The skew wall

The thermal wall

The efficiency wall

(Section 5)

(Section 6)

(Section 7)

•

•

•

5. The efficiency wall

speed gap between the processor and the memory

5.1. Overview

Basic reason:

(it widens on higher frequencies)

Memory transfer rates

DRAM latencies

Transfer rates of processor buses

L2 cache latencies

Main appearances of the speed gap between the processor and the memory:

•

•

•

•

5.1. Overview (2)

5.2. Speed gap between processor and memory (1a)

Figure 5.1a: DRAM types

DRAM FPM EDO BEDO SDRAM DRDRAM

Cycle time within a burst(for a 60 ns part)

Full burst timing

Max. bandwidth MB/s

Effective bandwidth MB/s

Examples

Remakes

Random access,typ. access time60/70/80/100 ns

(60 ns)

(5-5-5-5)

Access to 4subsequentcolumns

Overlapping theread and addresstransfer operations

Internal 2-bitaddress generator,

dual banks

Full pipelinedoperation,

assuming at leastdual banks

66/100/133 MHz

Asynchronous

Burst mode access (4*8B) on the same row (page)

Synchronous

Up to 66 MHz bus frequency

Internal on-chipSRAM cache,page is filled in

1 clock cycle,1-2 B wide data path256/300/356/400MHz transfer rate

~ 40 ns ~ 25 ns ~ 15 ns ~ 15/10/7.5 ns (4/3.3/2.8/2.5 ns)

(5-7)-3-3-3(5-7)-4-4-4

(5-7)-2-2-2 5-1-1-1 (5-7)-1-1-1

Triton I.: 7-3-3-3Triton III.: 6-3-3-3

Triton I.: 7-2-2-2Triton II,III.:6-2-2-2

Triton III.: 7-1-1-1430 ZX.: 7-1-1-1

820840

Developed byMICRON

Developed byRAMBUS

Level of overlapping

Since 1996

Cached structure

1

1

2

2

3

3

4

4

5

5

6

6

Dynamic RAMFast Page Mode DRAMExtended Data Out DRAM

Burst mode EDOSynchronous DRAMDirect Rambus DRAM

5.2. Speed gap between processor and memory (1b)

Figure 5.1b: Latency of DRAM chips

486 DX P PPro PII PIII386 DX

86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99

200

180

160

140

120

100

80

60

40

20

2000

*

PC AT

*

*

* *

**

**

*

*

16 K 128 K 256 K 256 K 4 M 16 M

tRAC

Year

Processorchipset

Typ. DRAMparts

(ns)

430 NX

4 M

4 M

4 M1 M 1 M

8 M

16 M 64 M64 M

16 M64 M 128 M

256 M

200

150

100

80

80

60

70

5060

50

30

450 KX/GX 440 BX 815

tRAC

: Row access time (time from row address until data valid)

128 K256 K

5.2. Speed gap between processor and memory (1c)

Figure 5.1c: System-level memory latency in x86-based PCs

486 DX P PPro PII PIII386 DX

86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99

100

10

1

2000

PC

Year

Processor

Memory latencyin proc. cycles

AT(286)(8088)

P4

50

1000

3020

500

200

23

5

*

*

*

*

10

40

85

702

300

**

*

1 1

3

Memory latencyns

500

400

300

200

100

*

*

**

*

155

135

141

116

468

*200

Latency in ns

Latency in proc.cycles

5.2. Speed gap between processor and memory (1d)

Figure 5.1d: Latency of DRAM chips (in clock cycles)

20

40

30

1.0 2.0fc

1.5 2.50.5

10 *

*

*

*

*

*

*

3.0 3.5

*

4.0

Memory latency

*

*

*

**

60

50

80

70

100

90

Pentium

Pentium Pro

Pentium II

Pentium III Pentium 4

RDRAM-40

120

110

*

*

*

*

**

RDRAM-60 DDR2 533

DDR 400

DDR 333

PC 133

PC 100

PC 66

386

EDO

(cycles)

FPM

130*

DDR 266

486

*

*

(GHz)

Figure 5.2: Relative transfer rate of memories (D: dual channel)

0.20

0.40

0.30

1.0 2.0fc

1.5 2.50.5

0.10

**

*

**

*

*

*

*

*

***

*

3.0 3.5

*

*

*

**

4.0

Tmemory/f c

*

*

*

**

**

*

*

*

** *

*

**

*

0.60

0.50

0.80

0.70

1.00

0.90

Pentium

Pentium Pro

Pentium II

Pentium III Pentium 4

PC-66

PC-100

PC-133

DDR 266

PC-800D

DDR 333

DDR 333D

**

*

*****

*

DDR 400

DDR 400D

DDR 533D

*

*

*

*

*

*

*

*

FPM

EDO

(GHz)

5.2. Speed gap between processor and memory (2)

fc max at intro.

(GHz)L2 size(Kbyte)

L2 latency(clock cycles)

Willamette 1.5 128 7

Northwood 2.0 512 16

Prescott 3.4 1024 23

Figure 5.3: Latency of L2 caches


Figure 5.4: Relative transfer rates of processor buses

0.20

0.40

0.30

1.0 2.0fc

1.5 2.50.5

0.10

*

*

*

*

*

*

*

**

**

**

*

3.0 3.5

**

*

*

*

4.0

Tpb/f c

*

**

*

*

*

*

**

*

*

*

*

*

**

*

0.60

0.50

0.80

0.70

1.00

0.90

Pentium

Pentium Pro

Pentium II

Pentium III

Pentium 4

66

100

133 400 533

8001066

(GHz)


5.3. Efficiency of 3rd generation superscalars (1)

5.5: Efficiency of Intel’s Pentium III and Pentium 4 processors in general purpose applications

0.40

0.5

0.45

1.0 2.0fc

1.5 2.50.5

0.35

0.30

**

*

*

*

*

**

*

*

*

*

*

**

*

*

****

0.55

3.0 3.5

***

**

*

*

**

*

**

*

*

4.0

Katmai512K dir L2

Coppermine256K on-die L2

Willamette256K on-die L2

Northwood A512K on-die L2

Prescott (1M)1M on-die L2

Prescott (2M)2M on-die L2

Irwindale512K on-die L2

2M on-die L3

800 MHz/PC-3200/SATA-150/HT800 MHz/PC-3200/ATA-100

100 MHzPC-100

SCSI-U2W

100 MHzPC-100

ATA-100

100 MHzPC-133

ATA-100

400 MHzPC-800 RDRAM

ATA-66

400 MHzPC-800 RDRAM

ATA-66

800 MHzPC-4300

SATA-150

Pentium 4Pentium III

SPECint_base2000/f c

Northwood C512K on-die L2

~~

800 MHz/PC-3200/ATA-100

533 MHzPC-800 RDRAM

ATA-100

800 MHzPC-3200

SATA-150HT

**

*

800 MHz/PC-2667/ATA-100

Northwood B512K on-die L2

* *

(GHz)

Figure 5.6: efficiency of AMD’s Athlon, Athlon XP and Athlon 64 processors in general purpose applications

0.40

0.50

0.45

1.5 2.5fc

2.0 3.01.0

0.35

0.30

* *

*

*

*

**

*

*

**

* *

*

**

*

*

0.65

3.5

**

*Palomino

256K on-die L2

Clawhammer1M on-die L2

Thorougbread256K on-die L2

200 MHzPC-100

ATA-66200 MHzPC-100

ATA-66

200 MHzPC-133

ATA-66

200 MHzPC-133

ATA-66

266 MHzPC-2100

ATA-100

266 MHzPC-2100

ATA-100

333 MHz/PC-2700/ATA-100

Athlon-XP

Athlon


0.5

Barton512K on-die L2

Thunderbird256K on-die L2

400 MHz/PC-3200/ATA-100

PC-3200ATA-133

f =fFSBmemory

K7512K dir L21

K75512K dir L22,3

1 f =0.5*fL2 c 2 f =0.4*f

L2 c (f =750/800/850 MHz)c3 f =0.3*f

L2 c (f =900/950/1000 MHz)c

Athlon 64

~~

0.60

~~

4.0(GHz)


Figure 5.7: Main aspects of the memory subsystem affecting core efficiency

fc

Core efficiency

Decreasing core efficiencydue to broadening Increasing core efficiency

primarily due to enhancing thememory subsystem(memory, FSB, L2)

(GHz)

the memory gap


Figure 5.8: Contrasting the efficiency of Intel’s and AMD’s

processors

0.40

0.50

0.45

1.0 2.0fc

1.5 2.50.5

0.35

0.30

**

*

*

*

*

**

*

*

*

*

*

**

*

*

****

3.0 3.5

***

**

4.0

512K/100

256K/100

256K/400

512K/400

1M/800

2M/800


512K/800

~~

**

*

512K/533

* *

**256K/200

* *

**

*

**512K/200

*

*

**

**

*

256K/266

*

*

512K/400

512K/333

0.65

0.60

***

1M/fFSB

1000

0.55

1200 1400 1600 1800

Pentium III

Pentium IV

Athlon

Athlon XP

Athlon 64

(GHz)


Figure 5.9: Contrasting Intel’s and AMD’s processor design philosophies

0.40

0.50

0.45

0.35

**

*

*

*

*

**

*

*

*

*

*

**

*

*

****

***

**

512K/100

256K/100

256K/400

512K/400

1M/800

2M/800


512K/800

~~

**

*

512K/533

* *

**

256K/200

* *

**

*

**512K/200

*

*

**

**

*

256K/266

*

*

512K/400

512K/333

0.65

0.60

***

1M/fFSB

1000

0.55

1200 1400 1600 1800

Designs preferringcore efficiency

Designs preferring clock frequency

1.0 2.0fc

1.5 2.50.5 3.0 3.5 4.0(GHz)

0.75

0.70

0.80

*2M/400

Pentium III

Pentium IV

Athlon

Athlon XP

Athlon 64

Pentium M


Diminishing return on higher clock frequencies

Implication of the emerging efficiency wall:


6. The thermal wall

6. The thermal wall (1)

Dissipation (D) :

Dd=A*C*V2*fc

withA: ratio of the active gates

C: effective capacity of the gates

V: supply voltage

fc: clock frequency

Ileak: leakage current

Dynamic Static

Ds=V*Ileak


Figure 6.1:Chip dynamic and static power dissipation trends

Source: N. S. Kim et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer, Dec. 2003, pp. 68-75.

Source:Solie D., „Technology Trends, Aug. 2006, http://www-03.ibm.com/procurement/proweb.nsf/objectdocswebview/file14+-+darryl+solie+-+ibm+power+symposium+presentation/$file/14+-+darryl+solie-ibm-power+symposium+presentation+v2.pdf

Figure 6.2: Dynamic and static power dissipation trends

http://www-03.ibm.com/procurement/proweb.nsf/objectdocswebview/file14+-+darryl+solie+-+ibm+power+symposium+presentation/

http://www-03.ibm.com/procurement/proweb.nsf/objectdocswebview/file14+-+darryl+solie+-+ibm+power+symposium+presentation/

Figure 6.3: Relative dissipation of Intel’s x86 family of processors

5

10

50

100

20

2

100 1000 5000

*

*

*

*

**

****

*

*

*

*

**

*

*

* **

*

*

*

Prescott

Northwood

Willamette

Tualatin

Coppermine

Katmai

Deshutes

Klamath

P6

P54CS

P54C

P5

*

(W/cm )2

fc2000200 50020 50

D/die area

0.8μ 0.6μ

0.6μ

0.35μ

0.35μ

0.35μ

0.25μ

0.25μ

0.18μ

0.18μ

0.13μ0.13μ

0.09μ

(MHz)


Figure 6.4: Contrasting the evolution of Intel’s and AMD’s processor lines with the thermal wall

0.40

0.50

0.45

0.35

**

*

*

*

*

**

*

*

*

*

*

**

*

*

****

***

**

512K/100

256K/100

256K/400

512K/400

1M/800

2M/800


512K/800

~~

**

*

512K/533

* *

**

256K/200

* *

**

*

**512K/200

*

*

**

**

*

256K/266

*

*

512K/400

512K/333

0.65

0.60

***

1M/fFSB

0.55

Thermal

wall

Core design,

technology

1.0 2.0fc

1.5 2.50.5 3.0 3.5 4.0 ~~(GHz)

1000 1200 1400 1600 18000.80

*2M/400

0.75

0.70

Pentium III

Pentium IV

Athlon

Athlon XP

Athlon 64

Pentium M


11/00 1/02

^

0.18 /42 mtrs

^

400 MHz FSB

Northwood-A

Xeon DP line

Desktop-line

Celeron-line

Willamette

1.4/1.5 GHz

(Value PC-s)

On-die 256K L2

0.13 /55 mtrs

400 MHz FSB

2A/2.2 GHzOn-die 512K L2

2/02

^

0.13 /55 mtrs

400 MHz FSB

1.8/2/2.2 GHz

On-die 512K L2

5/01

^

0.18 /42 mtrs

400 MHz FSB

1.4/1.5/1.7 GHz

On-die 256 K L2

11/02

^Prestonia-B

0.13 /55 mtrs

533 MHz FSB

2/2.4/2.6/2.8 GHz

On-die 512K L2

Foster Prestonia-A Nocona

2/04

^

0.09 /125mtrs

800 MHz FSB

2.80E/3E/3.20E/3.40E GHzOn-die 1M L2

2000 2001 2002 2003 2004

Xeon - MP line

3/02

^

0.18 /108 mtrs

400 MHz FSB

1.4/1.5/1.6 GHz

On-die 256K L2

11/02

^Gallatin

0.13 /178 mtrs

400 MHz FSB

1.5/1.9/2 GHz

On-die 512K L2

Foster-MP

On-die 512K/1M L3 On-die 1M/2M L3

5/02

^Northwood-B

0.13 /55 mtrs

533 MHz FSB

2.26/2.40B/2.53 GHzOn-die 512K L2

5/02^

Willamette-128

400 MHz FSB

1.7 GHz

11/02

^

6/04

^

0.09 / 125 mtrs

800 MHz FSB

2.8/3.0/3.2/3.4/3.6 GHz

On-die 1M L2

Northwood-B

533 MHz FSB

3.06 GHzOn-die 512K L2

0.13 /55 mtrs

400 MHz FSB

2 GHzOn-die 128K L2

0.18 0.13

9/02

^Northwood-128

On-die 128K L2

Cores supporting hyperthreading

5/03

^Northwood-C

800 MHz FSB

2.40C/2.60C/2.80C GHzOn-die 512K L2

0.13 /55 mtrs

Cores with EM64T implemented but not enabled

2005

2Q/05

^Potomac

0.09 > 3.5 MHz

On-die 1M L2On-die 8M L3 (?)

Irwindale-C

1Q/05

^

0.09 3.0/3.2/3.4/3.6 GHz

On-die 512K L2, 2M L3

Jayhawk

2Q/05

^

0.09

(Cancelled 5/04)

3.8 GHz

On-die 1M L2

3Q/05

^Tejas

0.09 /4.0/4.2 GHz

On-die 1M L2(Cancelled 5/04)

Irwindale-A

11/03

^

800 MHz FSB

3.2EE GHz


0.13 /178 mtrs

Cores supporting EM64T

6/04

^

0.09 /125mtrs

800 MHz FSB

2.8/3.0/3.2/3.4/3.6 GHz

On-die 1M L2

11/04

^Irwindale-B

0.13 /178mtrs

1066 MHz FSB

3.4EE GHzOn-die 512K L2, 2 MB L3

533 MHz FSB

2.4/2.53/2.66/2.8 GHzOn-die 256K L2

0.09

6/04

^Celeron-D

PGA 603 PGA 603

PGA 603 PGA 604

PGA 478 LGA 775

PGA 423 PGA 478 PGA 478 PGA 478 PGA 478 PGA 478 LGA 775

PGA 478 PGA 478

PGA 603 PGA 603

0.18 /42 mtrs

^

400 MHz FSB

Willamette

On-die 256K L2

PGA 478

3/04

^Gallatin

0.13 /286 mtrs

400 MHz FSB

2.2/2.7/3.0 GHz

On-die 512K L2On-die 2M/4M L3

PGA 603

8/01

PGA 478533 MHz FSB

2.53/2.66/2.80/2.93 GHzOn-die 256K L2

0.09

9/04

^Celeron-D

Extreme Edition

7/03

^Prestonia-C

0.13 /178 mtrs

533 MHz FSB

3.06 GHz


PGA 603

1.4 ... 2.0 GHz0.09 /125mtrs

800 MHz FSB

3.20F/3.40F/3.60F GHz

On-die 1M L2

LGA 775

8/04

^

12 13

8,9,10PrescottPrescott Prescott-F115 6,7

LGA 775

42,3

1 1

Figure 6.5: Intel’s P4 processor family (Netburst architecture)


Figure 6.6: The growth of relative dissipation of processors (in general)Source: R Hetherington, „The UltraSPARC T1 Processor” White Paper, Sun Inc., 2005


Implications of the thermal wall:


Processor designs focus now more and more on power aware technics

The approach to increase performance by aggressively raising clock frequency met the

thermal wall

7. The skew wall

Reason:

Figure 7.1: Skew between lines of parallel buses

63. bit

0. bit

Skew

7. The skew wall (1)

Figure 7.2: Equalizing skews among different bit lines of the

processor bus on the MSI 915G Combo motherboard



Introducing sequential buses

Figure 7.3: Signal transfer over a sequential bus

D+

D-

"0" "1"

(also in slow peripheral buses due to impressive cost savings)

Implication of emerging skews between bit lines of parallel buses:

Implication of emerging limits of evolution

The approach to aggressively raise clock frequencies met the efficiency, thermal and skew walls

and thus hit the dead end

8. EPIC architectures/processors

8. EPIC architectures/processors (1)

Essentially widening the core by introducing EPIC architectures

Aggresively raising clock frequency

effca IPCfP

Main road of evolution

(Sections 4 – 7) (Section 8)

Instructions

Principle of superscalar processing

FE

FE

FE

dynamicdependency resolution

Processor

dependent instructions

Principle ofVLIW processing

FE

FE

FE

VLIW: Very Large Instruction Word

independent instructions(static dependency resolution)

Processor

Figure 8.1: Contrasting the principles of operation of superscalar and VLIW processors


VLIW EPIC

EPIC: Explicitly Parallel Instruction Computer

enhanced VLIW

• branch prediction• explicit cache control• •

(integration of advanced superscalar features)


1994: Intel, HP

2001: IA-64 Itanium

1997:EPIC designation

5/01 6/03

^Îtanium 2Itanium

4/04

Îtanium 2

11/04

Îtanium 2

7/05

Îtanium 2

2001 2002 2003 2004 2005

Îtanium 2

7/02

0.18 /25 mtrs

64-bit FSB

733/800 MHz96K L2

266 MT/s

2/4M L3

0.18 /220 mtrs

128-bit FSB

800/1000 MHz256K L2

400 MT/s

1,5/3M L3

0.13 /410 mtrs

128-bit FSB

1.5 GHz256K L2

400 MT/s

6M L3

0.13 /410 mtrs

128-bit FSB

1.4/1.6 GHz256K L2

400 MT/s

3M L3

0.13 /592 mtrs

128-bit FSB

1.5/1.6 GHz256K L2

400 MT/s

3/4/6/9M L3128-bit FSB

1.66 GHz256K L2

667 MT/s

6/9M L3

9/03

Îtanium 2

(Merced) (Mc Kinley) (Madison)(Madison)

1

11

2

21.5 GHz with 4 MB L3

1.6 GHz with 3/6/9 MB L3

400 MT/s for 4/6/9 MB L3 GHz with 4 MB L3400/533 MT/s for 3 MB L3

Multiprocessor(MP-line)

Dual processor(DP-line)

0.13 /410 mtrs

128-bit FSB

1.4 GHz256K L2

400 MT/s

1.5M L3

(Madison)

Year

(Madison) (Madison)

Figure 8.2: Overview of Itanium cores


0.5

0.7

0.6

1000 2000fc

1500500

0.4*

*

0.9

0.8

1.0

Itanium

Itanium 2

64-bit FSB/266 MT/s

*

**

*

*

*

(MHz)~~~~


128-bit FSB/400 MT/s

96K L2/4M dir. L3

96K L2/2M dir. L3

256K L2/9M L3/DDR 266256K L2/6M L3/DDR 266

256K L2/3M L3/DDR 266

Figure 8.3: The efficiency of Itanium processors


Figure 8.4: Expected spreading of the IA-64 architecture (Itanium processors)

Source: L. Gwennap: Intel’s Itanium and IA-64: Technology and Market Forecast, MDR, 2000


Figure 8.5: Revenue expectations concerning Intel’s Itanium line


In general purpose applications:EPIC architectures/processors

play a decreasing role


9. The end of an era in processor evolution

9. The end of an era in processor evolution (1)

In general purpose applicationsbeginning with the 2. generation superscalars

processor efficiency leveled off,but both approaches to address leveling off efficiency

met limits of evolution and thus hit the dead end

Single core complex superscalars, –

at the end of an era


A new era in processor evolution–

The dawn of multicore, multithreded processors

The number of processors will double also in each ~ 24 months

Available hardware complexity increases further on exponentially

(Moore’s law)

Complexity is doubled in each ~ 24 moths

Figure 9.1: Rapid spreading of multi core processors

revealed by Intel


a new era in processor evolution dezső sima fall 2007 (ver. 2.2) dezső sima, 2007

Documents