dezső sima multicore and manycore processors december 2008 overview and trends

Dezső Sima

Multicore and Manycore Processors

December 2008

Overview and Trends

Overview

1. Overview•

2. Homogeneous multicore processors •

3. Heterogeneous multicore processors•

2.1 Conventional multicores•

3.1 Master/slave architectures•

3.2 Attached processor architectures•

4. Outlook•

2.2 Manycore processors•

1. Overview – inevitability of multicores

Figure: Evolution of Intel’s IC fab technology [1]

1. Overview – inevitability of multicores (1)

Shrinking: ~ 0.7/2 Years

IC fab technology

Moore’s rule

• same number of transistors: on ½ Si die area

Shrinking ~ 0.7x/2 years)

• on the same die area: 2x as many transistors

every two years

Doubling transistor counts ~ every two years (on the chips)

(2. formulation: from 1975)

Utilization of the surplus transistors?

Wider processor width

superscalar1. Gen. 2. Gen.

pipeline

Doubling transistor counts ~ every two years

Figure: Parallelism available in applications [2]

Available parallelism in general purpose apps: ~ 4-5

Wider processor width Core enhancements Cache enhancements

superscalar

• branch prediction• speculative loads• ...

L2/L3enhancements

(size, associativity ...)

1. Gen. 2. Gen.

pipeline

The best use of surplus transistors is: multiple cores

The inevitability of multicore processors

Increasing transistor count Diminishing return in performance

with doubling of core numbers ~ every two years

Figure: Spreading Intel’s multicore processors [3]

Figure 1.1: Main classes of multicore/manycore processors

Desktops

Heterogenous multicores

Homogenous multicores

Multicore processors

Manycore processors

Servers

with >8 cores

Conventionalmulticores

Master/slavearchitectures

Add-onarchitectures

CPU GPU

2 ≤ n ≤ 8 cores

General purpose computing

Prototypes/ experimental systems

MM/3D/HPCproduction stage

HPCnear future

2. Homogeneous multicores

2.1 Conventional multicores•

2.2 Manycore processors•

2. Homogeneous multicores

Desktops

Manycore processors

Servers

with >8 cores

Add-onarchitectures

CPU GPU

2 ≤ n ≤ 8 cores

HPCnear future

Figure 2.1: Main classes of multicore/manycore processors

2.1 Conventional multicores

Multicore MP servers•

Intel’s multicore MP servers•

AMD’s multicore MP servers•

2.1 Intel’s multicore MP servers (1)

Figure 2.1.1: Intel’s Tick-Tock development model [13]

The evolution of Intel’s basic microarchitecture

Figure 2.1.2: Overview of Intel’s Tick-Tock model and the related MP servers [24]

11/2005: First DC MP Xeon

1Q/2009

7100 (Tulsa)

7300 (Tigerton QC)

7400 (Dunnington)

7xxx (Beckton)

(Potomac)

7000 (Paxville MP)

(Cransfield)

7200 (Tigerton DC)

2x1 C 1 MB L2/C 16 MB L3

2x2 C 4 MB L2/C

1x6 C 3 MB L2/2C 16 MB L3

1x8 C ¼ MB L2/C 24 MB L3

1x1 C 8 MB L2

2x1 C ½ MB L2/C

1x1 C 1 MB L2

1x2 C 4 MB L2/C

3/2005: First 64-bit MP Xeons

TICK Pentium 4 /Prescott)

TOCK Pentium 4 /Irwindale)

Intel’s Tick-Tock model for MP servers

Figure 2.1.3: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

Preceding NBs

Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1

SC SC SC SC

1 Xeon MP before Potomac

Typically HI 1.5(266 MB/s)

System architecture (before Potomac)

MP Platforms

Xeon 7000

11/2005

MP Cores Xeon 7100

8/2006

MP Chipsets

3/2005 4/2006

8500 8501

(Paxville MP DC) (Tulsa DC)

(Twin Castle) (?)

Figure 2.1.4: Intel’s Xeon-based MP server platforms

2xFSB667 MT/s

4 x XMB(2 x DDR2)

2xFSB800 MT/s

4 x XMB(2 x DDR2)

Truland

65 nm/1328 mtrs

2x1 MB L216/8/4 MB L3

800/667 MT/smPGA 604

P4-based/65 nm

3/2005

Xeon MP

3/2005

(Potomac SC)

90 nm/2x169 mtrs

2x1 (2) MB L2-

800/667 MT/smPGA 604

90 nm/675 mtrs

1 MB L28/4 MB L3

667 MT/smPGA 604

P4-based/90 nm

Truland

First 64-bit server

Figure 2.1.5: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

Preceding NBs

SC SC SC SC

1 Xeon MPs before Potomac

Truland

• Cransfield SC)• Tulsa (DC)

3 The 8500 supports also

2 First x86-64 MP processor

Up to 2005 2005

Serial link

(Twin Castle)

8500/8501

28 PCIe lanes + HI 1.5

Potomac2

Paxville MP3

Potomac2

Paxville MP3

Potomac2

Paxville MP3

Potomac2

Paxville MP3

(266 MT/s)(7 GT/s)

External Memory Bridge

MP Platforms

Xeon 7000

11/2005

MP Cores Xeon 7200 Xeon 7300Xeon 7100

9/20078/2006

MP Chipsets

3/2005 4/2006 9/2007

8500 8501 7300

(Paxville MP DC) (Tulsa DC) (Tigerton DC) (Tigerton) QC

Caneland

9/2007

(Clarksboro)(Twin Castle) (?)

Figure 2.1.6: Intel’s Xeon-based MP server platforms

2xFSB667 MT/s

4 x XMB(2 x DDR2)

2xFSB800 MT/s

4 x XMB(2 x DDR2)

4xFSB1066 MT/s

4 x FBDIMM(DDR2)512GB

Truland

Xeon 7400

9/2008

(Dunnington 6C)

65 nm/1328 mtrs

2x1 MB L216/8/4 MB L3

800/667 MT/smPGA 604

65 nm/2x291 mtrs

2x4 MB L2-

1066 MT/smPGA 604

65 nm/2x291 mtrs

2x(4/3/2) MB L2-

1066 MT/smPGA 604

45 nm/1900 mtrs

9/6 MB L216/12/8 MB L3

1066 MT/smPGA 604

P4-based/65 nm Core2-based/65 nm Core2-based/45 nm

3/2005

Xeon MP

3/2005

(Potomac SC)

90 nm/2x169 mtrs

2x1 (2) MB L2-

800/667 MT/smPGA 604

90 nm/675 mtrs

1 MB L28/4 MB L3

667 MT/smPGA 604

P4-based/90 nm

Truland Caneland

Figure 2.1.7: Evolution of Intel’s Xeon MP-based system architecture

(until the appearance of Nehalem)

Preceding NBs

(Clarksboro)

Tigerton Tigerton Tigerton Tigerton

6C/QC/DC 6C/QC/DC 6C/QC/DC 6C/QC/DC

SC SC SC SC

Dunnington Dunnington Dunnington Dunnington

8 PCI-E lanes + ESI

Truland

Caneland

1 Xeon MP before Potomac • Cransfield SC)• Tulsa (DC)

3 The 6500 supports also

2 First x86-64 MP processor

(2 GT/s) (1 GT/s)

QC/8CFB-DIMM(DDR2)

28 PCIe lanes + HI 1.5

Potomac2

Paxville MP3

Potomac2

Paxville MP3

Potomac2

Paxville MP3

Potomac2

Paxville MP3

(266 MT/s)(7 GT/s)

(Twin Castle) 8500/8501DC

Up to 2005 2005

Figure 2.1.8: Nehalem’s key innovations concerning the system architecture [22]

Nehalem’s key innovations concerning the system architecture (11/2008)

Figure 2.1.9: Nehalem’s key innovations concerning the system architecture [22]

Nehalem’s key innovations concerning the system architecture (11/2008)

Beckton 8C

Beckton8C

Beckton 8C

QPI QPI

QPIQPI

QPI: QuickPath Interconnect

QPIQPI

Figure 2.1.10: Intel’s Nehalem based MP server architecture

4xFB-DIMM

11/2008: Nehalem

AMD’s multicore MP servers•

2.1 AMD’s multicore MP servers (1)

AMD Direct Connect Architecture (2003)

• Integrated Memory Controller• Serial HyperTransport links

Figure 2.1.11: AMD’s Direct Connect Architecture [14]

Remark

• 3 HT 1.0 links at introduction (K8),• 4 HT 3.0 links with K10 (Barcelona)

Introduced in 2003 along with the x86-64 ISA extension

(Intel: 2008 with Nehalem)

Use of available HyperTransport links [44]

Each link supports connections to I/O devices

Two links support connections to I/O devices,any one of the three links may connect to another DP or MP processor

Each link supports connections to I/O devices or other DP or MP processors

AMDOpteron

PCI Express

AMDOpteron

RDD2 RDD2

Figure 2.1.12: 2P and 4P server architectures based on AMD’s Direct Connect Architecture [15], [16]

Figure 2.1.13: Block diagram of Barcelona (K10) vs K8 [17]

Figure 2.1.14: Possible use of Barcelona’s four HT 3.0 links [39]

Novel features of HT 3.0 links, such as

Current platforms (2. Gen. Socket F with available chipsets) do not support HT3.0 links [46].

• higher speed or• splitting a 16-bit HT link to two 8-bit links

can be utilized only with a new platform.

Figure 2.1.15: AMD’s roadmap for server processors and platforms [19]

2.2 Manycore processors

Desktops

Manycore processors

Servers

with >8 cores

Add-onarchitectures

CPU GPU

2 ≤ n ≤ 8 cores

HPCnear future

Figure 2.2.1: Main classes of multicore/manycore processors

Intel’s Larrabee•

Intel’s Tiled processor•

Larrabee

Part of Intel’s Tera-Scale Initiative.

Project started ~ 2005First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27])Due in ~ 2009

• Performance (targeted): 2 TFlops

• Brief history:

• Objectives:

Not a single product but a base architecture for a number of different products.High end graphics processing, HPC

2.2 Intel’ Larrabee (1)

Figure 2.2.2: Block diagram of the Larrabee [4]

Basic architecture

• Cores: In order, 4-way multithreaded x86 IA cores, augmented with SIMD-16 capability

• L2 cache: fully coherent

• Ring bus: 1024 bits wide

Figure 2.2.5: Larrabee vs the Pentium [11]

• 64-bit instructions

• 4-way multithreaded (with 4 register sets)

• addition of a 16-wide (16x32-bit) VU

• increased L1 caches (32 KB vs 8 KB)

• access to its 256 KB local subset of a coherent L2 cache

• ring network to access the coherent L2 $ and allow interproc. communication.

Main extensions

Figure 2.2.3: Block diagram of the Vector Unit [5]

The Vector Unit

VU scatter-gather instructions

(load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly).

8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty.

Numeric conversions

L1 D$ becomesas an extension of the register file

Mask registers

have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.

Figure 2.2.4: Layout of the 16-wide vector ALU [5]

• ALUs execute integer, SP and DP FP instructions• Multiply-add instructions are available.

Figure 2.2.6: System architecture of a Larrabee based 4-processor MP server [6]

CSI: Common Systems Interface (Serial packet-based bus)

Programming of Larrabee [5]

• Larrabee has x86 cores with an unspecified ISA extension,

Figure 2.2.7: Intel’s ISA extensions [11]

AES: Advanced Encryption Standard

AVX: Advanced Vector Extension

FMA: FP fused multiply-add instr. supporting 256-bit/128-bit SIMD

Programming of Larrabee [5]

• Larrabee has x86 cores with an unspecified ISA extension,

• x86 cores allow to program Larrabee as usual x86 processors, by using enhanced C/C++ compilers from MS, Intel, GCC etc.

• this is a huge advantage compared to the competition (Nvidia, AMD/ATI),

Intel’s Tiled processor•

• First implementation of Intel’s Tera-Scale Initiative

Announced at IDF Fall 2006 9/2006Details at ISSCC 2007 2/2007Due to 2009/2010

• Aim: Tera-Scale research chip

- high bandwidth interconnect - energy management - programming manycore processors

(among more than 100 projects)

• Milestones of the development:

Tiled Processor

2.2 Intel’s Tiled processzor (1)

Remark

Based on ideas of the Raw processor (MIT)

Figure 2.2.8: Basic structure of the Tiled Processor [7]

2 single precision FP (Multiply-Add)

Figure 2.2.9: Block diagram of a tile [7], [9]

VLIW microarchitecture?

(For debugging)

SP FP cores

Figure 2.2.10: Die shot

of the Tiled Proc.[8]

Figure 2.2.13: Ring based interconnect network topology [7]

Figure 2.2.14: Mesh interconnect topology [7]

Figure 2.2.11: Integration of dedicated hardware units (accelerators) [7]

Figure 2.2.12: Sleeping inactivated cores [7]

Figure 2.2.15: Performance figures of the Tiled Processor [7]

Matrix multiplication (Single Precision)

Peak performance

4 SP FP/cycle

at 4 GHz:

1.6 TFLOPS

3. Heterogenous multicores

3.1 Master/slave architectures•

2.2 Attached architectures•

3. Heterogenous multicores

Figure 3.1: Main classes of multicore processors

Desktops

Manycore processors

Servers

with >8 cores

ConventionalMC processors

Add-onarchitectures

CPU GPU

2 ≤ n ≤ 8 cores

HPCnear future

3.1 Master/slave architectures

The Cell BE•

3.1 The Cell BE (1)

Computational model

Master-slave computational model with cacheless private memory spaces (LSs)

allow efficient utilization of the die area for computations

• transferring the tasks (programs and data) from the master to the slaves and the results back from the slaves to the master

• Master slave computational model

allows to delegate tasks to dedicated, task-efficient units

• synchronization between the master and the

slaves• inter-core communication and synchronization • Cacheless private memory spaces

needs efficient mechanisms for

needs an efficient LS-based microarchitecture for the slaves

Performance @ 3.2 GHz:

QS21 Peak performance (SP FP): 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle)

3.1 The Cell BE (2)

3.1 The Cell BE (3)

Figure 3.1.2: Cell roadmap from 2007 [22]

3.2 Attached architectures

Figure 3.2.1: Main classes of multicore/manycore processors

Desktops

Manycore processors

Servers

with >8 cores

ConventionalMC processors

Add-onarchitectures

CPU GPU

2 ≤ n ≤ 8 cores

HPCnear future

3.2 Attached architectures

Introduction to GPGPUs•

The SIMT computational model (CM)•

Recent implementations of the SIMT CM•

Intel’s future processors with attached architecture•

AMD’s future processors with attached architecture•

Introduction to GPGPUs•

Figure 3.2.2: Evolution of the microarchitecture of GPUs [23]

3.2 Introduction to GPGPUs (1)

Evolution of the microarchitecture of GPUs

Figure 3.2.3: Simplified block

diagram of AMD/ATI’s RV770 [24]

160 cores x 5 execution units

Figure 3.2.4: Simplified structure of a core of the RV770 GPGPU [24]

Execution units (Stream Processing Units)

• 32-bit FP (ADD, MUL, MADD) • 64-bit FP • 32-bit FX . . .

Figure 3.2.5: Peak SP FP performance figures Nvida’s GPUs vs Intel’s CPUs [25]

Figure 3.2.6: Bandwidth figures: Nvidia’s GPUs vs Intel’s CPUs [GB/s] [25]

Not cached

Figure 3.2.7:Utilization of the die are in CPUs vs GPUs [25]

Based on their FP32 computing capability and the large number of execution units available

GPUs with unified shader architecture are prospective candidates for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs

(General Purpose GPUs)

For HPC computations SIMT (Single Instruction Multiple Treads) computation model

Use of GPUs for HPC

The SIMT computational model (CM)•

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors

• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)

Figure 3.2.8: Main alternatives of data parallel execution

3.2 The SIMT computational model (1)

Scalar execution SIMD execution SIMT execution

Domain of execution:single data elements

Domain of execution:elements of vectors

Domain of execution:elements of matrices

(at the programming level)

Figure 3.2.9: Scope of the data parallel execution vs scalar execution (at the programming level)

Remarks

1. SIMT execution is also termed as SPMD (Single-Program Multiple-Data) execution (Nvidia)2. At the processor level two dimensional domains of execution can be mapped to any set of cores (e.g. to a line of cores).

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors

• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)

E.g. 2. and 3. generationsuperscalars

GPGPUs,data parallel accelerators

Figure 3.2.10: Main alternatives of data parallel execution

• data dependent flow control as well as • barrier synchronization

• is massively multithreaded, and provides

Recent implementations of the SIMT CM•

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Dedicated units supporting data parallel execution

with appropriate programming environment

Programmable GPUs with appropriate

programming environments

E.g. Nvidia’s 8800 and GTX linesAMD’s HD 38xx, HD48xx lines

Nvidia’s Tesla linesAMD’s FireStream lines

Have display outputs No display outputsHave larger memories than GPGPUs

Figure 3.2.12: Basic implementation alternatives of the SIMT execution

3.2 Recent implementations of the SIMT CM (1)

GPGPUs

Nvidia’s line AMD/ATI’s line

Figure 3.2.13: GPGPU families of Nvidia and AMD/ATI’

90 nm G80

65 nm G92 G200

Shrink Enhanced arch.

80 nm R600

55 nm RV670 RV770

Shrink Enhanced arch.

48 ALUs

65 nm/1400 mtrs

90 nm/681 mtrs

2005 2006 2007 2008

96 ALUs320-bit

8800 GTS

65 nm/754 mtrs

128 ALUs384-bit

8800 GTX

112 ALUs256-bit

8800 GT

192 ALUs448-bit

GTX260

240 ALUs512-bit

GTX280

Version 1.0

Version 1.1

Version 2.0

55 nm/956 mtrs

80 nm/681 mtrs

55 nm/666 mtrs

R670 RV770

320 ALUs512-bit

HD 2900XT

320 ALUs256-bit

HD 3850

320 ALUs256-bit

HD 3870

800 ALUs256-bit

HD 4850

800 ALUs256-bit

HD 4870Cards (Xbox)

Brook+Brooks+

RapidMind

NVidia

AMD/ATI

support

Figure 3.2.14: Overview of GPGPUs

Implementation alternatives of data parallel accelerators

On card implementation

Recent implementations

E.g. GPU cards

Nvidia’s Tesla AMD/ATI’s FireStream accelerator families

Figure 3.2.15: Implementation alternatives of data parallel accelerators

Data parallel accelerators

GT200-based4 GB GDDR30.936 GLOPS

G80-based1.5 GB GDDR30.519 GLOPS

Desktop

IU Server

2007 2008

NVidia Tesla

G80-based2*C870 incl.3 GB GDDR31.037 GLOPS

G80-based4*C870 incl.6 GB GDDR32.074 GLOPS

Version 1.0

GT200-based4*C1060

16 GB GDDR33.744 GLOPS

Version 1.01

Version 2.0

Figure 3.2.16: Overview of Nvidia’s Tesla family

Shipped

RV670-based2 GB GDDR3

500 GLOPS FP32~200 GLOPS FP64

Stream Computing SDK

2007 2008

Rapid Mind

AMD FireStream

RV770-based1 GB GDDR31 TLOPS FP32

~300 GFLOPS FP64

Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

Version 1.0

Shipped

Figure 3.2.17: Overview of AMD/ATI’s FireStream family

Implementation alternatives of data parallel accelerators

On-dieintegration

On card implementation

Recent implementations

Futureimplementations

E.g. GPU cards

Nvidia’s Tesla

AMD/ATI’s FireStream

accelerator families

Intel’s HeavendahlAMD’s Fusion

integration technology

Figure 3.2.15: Implementation alternatives of data parallel accelerators

Data parallel accelerators

Figure 3.2.18: Expected evolution of attached GPGPUs [42]

Integration to the chip

Intel’s future processors with attached architecture•

3.2 Intel’s future processors with attached architecture (1)

Figure 3.2.19: Intel’s desktop roadmap [26]

Core2 i7 (Nehalem)Pentium4

3.2 Intel’s future processors with attached architecture (2)

Figure 3.2.20: A part of Intel’s desktop roadmap [26]

Q2/09 Q3/09Q4/08 Q1/09

(45 nm)

AMD’s future processors with attached architecture•

3.2 AMD’s future processors with attached architecture (1)

Figure 3.2.21: AMD’s view about the major phases of processor evolution [27]

6/2006 The Torrenza initiative (2006 Technology Analyst Day)

• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].

Figure 3.2.22: Introduction of the Torrenza platform level integration technique [40]

(cache coherent HT)

6/2006 The Torrenza initiative (2006 Technology Analyst Day)

• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].

10/2006 Acquisition of ATI

10/2006 The Fusion initiative

• Silicon level integration of accelerators to AMD processors (first Fusion processors due to the end of 2008 early 2009) [41]

3/2007 “Integration” of the Torrenza and the Fusion initiatives into a continuum of accelerated computing solutions

Figure 3.2.23: The Torrenza platform and the Fusion integration technologyas a continuum for accelerated computing solutions [29]

Remark: It is based on an earlier Alienware presentation from 6/2006 [38].

Implementation of Fusion processors

• 2007/2008 AMD made a number of confusing announcements and withdrawals [31] – [35].

• According to the latest announcements (11/2008) AMD plans to introduce 32 nm Fusion processors only in 2011 [37].

Figure 3.2.24: AMD’ 2008 roadmap for client processors [37]

4. Outlook

4. Outlook (1)

Outlook

Add-onarchitectures

1(Ma):M(S) 2(Ma):M(S) M(Ma):M(S) 1(CPU):1(D) M(CPU):1(D) M(CPU):M(D)

Ma: MasterS: SlaveM: Many

D: Dedicated (like GPU)H: HomogenousM: Many

M(Ma) = M(CPU)

M(S) M(D)

?M(S) M(D)

Figure 4.1: Expected evolution of heterogeneous multicore processors

The future of heterogenous multicores

4. Outlook (2)

M(CPU):M(D)Heterogenous multicores:

The future of homogeneous multicores

Larrabee

Tiled processor

In fact: both are of the same type: M(CPU):M(D)

4. Outlook (3)

Figure 4.2: Simplified block diagrams of Larrabee and the Tiled processor [4], [7]

4. Outlook (4)

The main road of processor evolution M(CPU):M(D)

Thank you for your attention!

5. References

[1]: Bhandarkar D., „The Dawn of a New Era”, 11. EMEA, May 2006, Budapest,

[2]: Wall D. W., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-15.html

[3]: Loktu A.,”Itanium 2 for Enterprise Computing,” http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

[4]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee- intels-biggest-leap-ahead-since-the-pentium-pro.html

[5]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[6]: Timm J.-F., “Larrabee: Fakten zur Intel Highend-Grafikkarte,” Computer Base, 2. Juni 2007, http://www.computerbase.de/news/hardware/grafikkarten/2007/juni/larrabee_fakten_ intel_highend-grafikkarte/

[7]: Shrout R., “Intel’s 80 Core Terascale Chip Explored: 4 GHz Clocks and more,” PC Perspective, Feb. 11. 2007, http://www.pcper.com/article.php?aid=363

[8]: Goto H., “Intel’s Manycore CPUs,” PC Watch, June 11. 2007, http://pc.watch.impress.co.jp/docs/2007/0611/kaigai364.htm

[9]: Hoskote Y. & al., “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro, Sept./Oct. 2007, (Vol. 27 No. 5), pp. 51-61

[10]: Taylor M. & al., “The Raw Processor,” Hot Chips Aug. 13. 2001, http://www.hotchips.org/archives/hc13/3_Tue/22mit.pdf

[11]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008, http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm

[12]: Stokes J., “Larrabee: Intel's biggest leap since the Pentium Pro,” Ars Technica, Aug. 04, 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels- biggest-leap-ahead-since-the-pentium-pro.html

[13]: Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family: Architecture Insight and Power Management , IDF Taipeh, Oct. 2008, http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF -Taipei_TPTS001_100.pdf

[14]: AMD Opteron™ Processor for Servers and Workstations, http://amd.com.cn/CHCN/Processors/ProductInformation/0,,30_118_8826_8832, 00-1.html

[15]: AMD Opteron Processor with Direct Connect Architecture, 2P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/2P_Power_PID_41497.pdf

[16]: AMD Opteron Processor with Direct Connect Architecture, 4P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/4P_Power_PID_41498.pdf

[17]: Kanter D., “Inside Barcelona: AMD's Next Generation, Real World Tech., May 16. 2007, http://www.realworldtech.com/page.cfm?ArticleID=RWT051607033728

[18]: Kanter D,, “AMD's K8L and 4x4 Preview, Real World Tech. June 02. 2006, http://www.realworldtech.com/page.cfm?ArticleID=RWT060206035626&p=1

[19]: Enderle R., AMD Shanghai “We are back! TGDaily, November 13. 2008, http://www.tgdaily.com/content/view/40176/128/

[20] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf

[21]: Wright C., Henning P., Bergen B., “Roadrunner Tutorial – An Introduction to Roadrunner and the Cell Processor,” Febr. 7 2008, http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf

[22]: Hofstee H. P., “Industry trends in Microprocessor Design,”, IBM, Oct. 4 2007, http://lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/Cell_Hofstee_Non_Conf.pdf

[23]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[24]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[25]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia

[26]: Goto H., “Intel Desktop CPU Roadmap,” 2008, http://pc.watch.impress.co.jp/docs/2008/0326/kaigai02.pdf

[27]: The industry-Changing Imact of Accelerated Computing – Fusion White Paper, AMD, 2008, http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf

[28]: AMD Announces Initiatives To Elevate AMD64 As Platform For System- And Industry-Wide Innovation, AMD, Jine 1 2006. http://www.amd.com/us-en/Weblets/0,,7832_8366_5730~109409,00.html

[29]: Metal G., “AMD Torrenza and Fusion together,” Metalghost, March 22 2007, http://www.metalghost.ro/index.php?view=article&catid=30%3Ahardware&id =233%3Aamd-torrenza-and-fusion-together&option=com_content

IT Hardware IT NewsIT ReviewsProcessorsMotherboardsMemoriesGraphic CardsStorage DevicesDisplaysDigital Audio Devices

Web Design Ghost Web SitesMetal BandsHardware & SoftwarePetrochemicalBucovinaStep1 Media PromotionKinder LandFloridan TransMiniclip OnlineShipcare ServicesReebok Mania

Latest Articles Most Popular

[30]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[31]: Hester P., 2007 Technology Analyst Day, AMD, July 26 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

[32]: Rivas M., 2007 Financial Analyst Day, AMD, Dec. 13. 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

[33]: Smalley T., “Shrike is AMD’s First Fusion Platform,”, Trusted Reviews, June 9. 2008, http://www.trustedreviews.com/notebooks/news/2008/06/09/Shrike-Is-AMDs- First-Fusion-Platform/p1

[34]: Hruska J., “AMD Fusion now pushed back to 2011,” Ars Technica, Nov. 14. 2008, http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to -2011.html

[35]: Gruener W., “AMD delays Fusion processor to 2011,” TgDaily, Nov. 13. 2008, http://www.tgdaily.com/content/view/40186/135

[36]: Wilson D., “AMD Analyst Day Platform Announcements,” Anandtech, June 2. 2006, http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2

[37]:Allen R., Financial Analyst Day, AMD, Nov. 13. 2008, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/RandyAllen AMD2008AnalystDay11-13-2008.pdf

[38]: Gonzales N., 2006 Technology Analyst Day, Alienware, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf

[39]: Hester P., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf

[40]: Seyer M., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MartySeyer AMDAnalystWebv3.pdf

[41]: AMD Completes ATI Acquisition and Creates Processing Powerhouse , Oct. 25. 2006, http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543 ~113741,00.html

[42]: Stokes J., “A closer look at AMD’s CPU/GPU Fusion,” Ars Technica, Nov. 19. 2006, http://arstechnica.com/news.ars/post/20061119-8250.html

842-856 (Athens)

82xx (Santa Rosa)

8347-56 (Barcelona)

840-850 (Sledgehammer)

865-890 (Egyipt)

1x1 C 1 MB L2

1x2 C 2 MB L2/C

1x4 C 1/2 MB L2/C 2 MB L3

3 MB L3

1x1 C 1 MB L2

1x2 C 2 MB L2/2C

8378-84 (Shanghai)

Figure: AMD’s Tick-Tock model and the related Opteron MP servers

TICK45nm

Figure: Larrabee’s Software stack [12]

Figure: Layout of MIT’s Raw Processor [10]

3.1 The Cell BE (1)

Cell BE

02/2006 Cell Blade QS2008/2007 Cell Blade QS2105/2008 Cell Blade QS22

• Joint development of Sony, IBM and Toshiba

11/2006 Playstation 3 (PS3) QS2x Blade Server family

• 2x PSX3 performance• 12 cores/45 nm• GDDR3/DDR3 (instead of XDR)

Summer 2000: Basic decisions concerning the architecture

• Aim: Games, multimedia, in addition HPC

Rumors (9/2008):

2011? Playstation4 (Competition: XBox3)

EIB: Element Interface Bus

Figure 3.1.1: Block diagram of the Cell BE [20]

SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit

PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit

MIC Memory Interface Contr.BIC: Bus Interface Contr.

XDR: Rambus DRAM

3.1 The Cell BE (3)

Figure: Layout of the EIB [21]

3.1 Mester/szolga elvű többmagos processzorok - A Cell (4)

Figure: Concurrent data transfers over the EIB [21]

Massive multithreading

Multithreading is implemented by

creating and managing parallel executable threads for each data element of the execution domain.

Figure: Threads allocated to the element of an execution domain

Same instructions for all data elements

3.2 Csatolt többmagos processzorok (10)

ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

CTX CTX CTX CTX CTX CTX CTXCTX

Actual context Register file (RF)

Context switch

Figure 3.2.11: Per thread contexts needed per ALU for fast context switch

Fetch/Decode

SIMT core

Figure: Early vision on the integration of CPUs and GPUs(Presented by Alienware (performance pc maker) [38]

Alienware’s early vision on the integration of CPUs and GPUs (6/2006)

Figure: AMD’s view about the evolution of mainstream computing [30]

5.2 AMD/ATI’s GPGPU line (1)

Figure: AMD’s planned 32 nm mobile/mainstream Falcon Fusion family [31]

(32 nm brand new core)

AMD’s plans to implement Fusion class processors

The 32 nm Falcon processor with the Bulldozer CPU core (7/2007: Technology Analyst Day)

(UVD: Unified Video Decoder)

The 45 nm Swift processor family (12/2007: Financial Analyst Day)

Figure: AMD’s planned 45 nm Swift Fusion processor family [32]

Figure: AMD’s planned 45 nm Shrike mobile platform with the Swift processor [33]

The 45 nm Shrike platform with the Swift processor (6/2008)

Nov. 2008 (Financial Analyst Day):

AMD cancelled both the 45 nm Shrike platform and the Swift processor [34], [35]

Reason

The 45 nm implementation would result only in modest improvements in performance, power and cost.

Recent plan

• 32 nm technology is awaited to implement the planned CPU GPU integration, it is due to in 2011.

5. References

[1]: Bhandarkar ic techn

[2]: Wall D., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl

[3]: Intel mc

Large-Scale Systems Modeling:Networks of QS2x Blades

Peter Altevogt, Tibor Kiss IBM STG Boeblingen

Wolfgang DenzelIBM Research Zurich

Miklos KozlovszkyBudapest Tech

Research objectives

Provide simulation infrastructure for a

detailed modification analysis of IO subsystems, networks and workloads

limited modification analysis of processor cores:as workload generators they are treated as black (grey) boxes

workload characterization is based on low-level processor core simulations or measurements

SubtasksHigh-Level Simulation Design of Networks of QS2x Blades

System representationWorkload representation

Implementation

Modeled Components

Workloadas generated by the processor cores

System components:processor cores* as workload generators

for executing computational delays

memory and IO subsystemsbus interfaces, southbridges, network adapter

networkswitches, router,...

* without bus interfaces

Network

General SetupBlades

... : requests

High-Level Simulation Design

• Blade system: hardware view

EIB1 EIB0

SB1 SB0

Cores0Cores1

Processor cores Southbridges

Network adapter

Memories

to/fromnetwork

Processor cores:–generating requests against IO

subsystem / network–executing computational

requests in form of delays

High-Level Simulation Design (2):

• Blade system: detailed simulation view

Processor cores (2 chips in case of

Blades)

1EIB1SB1

0EIB0SB0

IO subsystem

Adaptive workloadgenerator

network

Workload generator:–generating requests against

IO subsystem / networkProcessor cores:

–executing computational requests in form of delays

Figure: Overview of the implementation of Intel’s Tick-Tock model for MP servers [24]

2x1 C, 1 MB L2/C 16 MB L3 7100 (Tulsa)

1x2 C, 4 MB L2/C 7200 (Tigerton DC)

2x2 C, 4 MB L2/C 7300 (Tigerton QC)

1x6 C, 3 MB L2/2C 16 MB L3 7400 (Dunnington)

1x8 C, ¼ MB L2/C 24 MB L3 7xxx (Beckton)

2. Intel’s MP servers (5)

TICK Pentium 4 /Prescott) 1x1 C, 8 MB/C (Potomac)

TOCK Pentium 4 /Irwindale) 2x1 C, ½ MB/C 7000 (Paxville MP)

1x1 C, 1 MB/C (Cransfield) 90nm

3/2005: First 64-bit MP Xeon11/2005: First DC MP Xeon

1Q/2009

2.2 ábra: Intel MP szerver lapka készleteinek fejlődése

Preceding NB

Potomac Potomac Potomac Potomac

Clarksboro

Tigerton Tigerton Tigerton Tigerton

(Twin Castle)

Paxville MPTulsa

DC/QC DC/QC DC/QC DC/QC

SC SC SC SC DC DC DC DC

2005: 2006:

DDR/DDR2

FBDIMM/DDR2

DDR/DDR2

2.1 – Intel többmagos MP szerver processzorai (2)

FSB FSB

FB-DIMM DDR2

192 GB 7200 DC 7300 QC(Tigerton)

2.3 ábra: Négyfoglalatos 7300 (Caneland) alaplap (Supermicro X7QC3)

SBE2 SB

7300 NB

2.1 – Intel többmagos MP szerver processzorai (3)

UP: Opteron 100/1000 DP: Opteron 200/2000, MP: 800/8000

1MB L2 Cache

System Request Interface

Crossbar Switch

MemoryController HT

1MB L2 Cache

System Request Interface

Crossbar Switch

MemoryController 0 1 2

1MB L2 Cache

HyperTransport™

2 x 72 bit 2 x 72 bit 800/8000: 3 coherent links200/2000: 1 coherent link

2.4 ábra: Az Opteron család alapvető felépítése

2.1 – AMD többmagos MP szerver processzorai (1)

2.5 ábra: AMD 4P/8P Direct Connect szerver architektúrája

2.6 ábra: Intel Nehalem processzcsaládjának (2008 nov.17) rendszer architektúrája

On-die Memory Controller

a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)

The PPE• is optimized to run a 32/64-bit OS• controls usually the SPEs,• complies with the 64-bit PowerPC ISA.

• the PPE is more adept at control-intensive tasks and quicker in task switching,• the SPEs are more adept at compute intensive tasks and slower at task switcing.

• are optimized to run compute intensive SIMD apps.,• operate usually under the control of the PPE,• run their individual apps. (threads),• have full access to a coherent shared memory including the memory mapped I/O-space,• can be programmed in C/C++.

The SPEs

Contrasting the PPE and the SPEs

Unique features of the Cell BE

Overview of the Cell BE (4)

b) The SPEs have an unusual storage architecture, as

• SPEs

• The LS

access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands).

operate in connection with a local store (LS) of 256 KB, i.e.

o they fetch instructions from their private LS and

o their Load/Store-instructions access their LS rather than the main store,

• SPEs

has no associated cache.

Overview of the Cell BE (5)

Figure: Die shot and floorplan the Cell BE (221mm2, 234 mtrs) [15]

4. Kitekintés (1)

Processor Technology Aim

Bloomfield (45 nm) desktopBeckton (45 nm) MP serverWestmare (32 nm) desktop DP server

Cores Memory channels

4 triple channel DDR3 8 quad channel FB_DIMM (2)

4/6 triple channel DDR3 4/6 quad channel DDR3

Intel’s Nehalem (i7) family (17. Nov. 2008)

• Integrated memory controller

• 4/6/8-cores

• Dual-threaded

• FSB replaced by a serial bus (QuickPath Interconnect)

Main features

http://pc.watch.impress.co.jp/docs/2007/0122/kaigai330.htm

http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG

HTX slots will be standard interfaces connected directly to an AMD CPU's HyperTransport link. If both of these links are coherent, the device and the CPU will be able to communicate directly with each other with cache coherency. Because of this, latency can be reduced greatly over other buses as well, enabling hardware vendors to begin to create true coprocessor technology once again.

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2

http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf

Fusion announced in Oct. 2006, due to in 1H 2008.

http://www.google.com/search?hl=en&q=amd+fusion+pcwatch&btnG=Google+Search&aq=f&oq=

http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007_AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf

32nm brand new 32 nm core

http://download.amd.com/Corporate/MarioRivasDec2007AMDAnalystDay.pdf

fusion constraintsdie size

dissipationmemory bandwidth

Phil HesterFusion will never go to high end due to dissipation

AMD's CPU die size of the high-end desktop CPU about 200 square mm, with the main stream CPU 120-150 sq mm, the value CPU around 100 square mm or less. Therefore, FUSION half die (semiconductor units) as a spare GPU core, the size of the core GPU with a degree can be constrained. GPU die, the high-end GPU more than 300 square mm, with midrange GPU 120-150 square mm, Value-GPU around 100 square mm or less. Therefore, 45nm process can be integrated into FUSION-generation GPU core, 65nm-generation below the rank of the discrete GPU will be the size and extent

cpu uses commodity dram gpu graphics dram gddr3, 4, 5

memory size bandwidth mem. data path 8 B 32/64 B

coexistence of torrenza and fusion (high end: torrenza)

を真剣に考えているようだ。

FUSIONプロセッサの想定図

PDF版はこちら

http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG

http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to-2011.html

Nov 14 20008

http://www.techpowerup.com/reviews/AMD/Analysts_Day

PC Watch, 07.01.31)

Memory bandwidth per 10GFLOPS needs 1GB/sec about and will be calculated.

The 45 nm Fusion processor, initially promised as a 2009 chip and then moved into 2010 is essentially cancelled.

The chip, which was described to combine a CPU and GPU under one hood in the “Shrike” core, was found to only bring modest improvements over today’s platforms in terms of power efficiency, cost and performance. Instead, the company will introduce Fusion (which actually isn’t called Fusion anymore) as a 2011 model in a 32 nm version with Llano core. Allen said that 32 nm would be the right technology to introduce the product. Llano will feature four cores, 4 MB of cache, DDR3 memory support and an integrated GPU.

http://www.tgdaily.com/content/view/40186/135/

Nov 13 2008

Possible use of surplus transistors

Wider processor width Core enhancements Cache enhancements

superscalar

• branch prediction• speculative loads• ...

L2/L3enhancements

(size, associativity ...)

1. Gen. 2. Gen.

pipeline

Moore’s rule

1. Többmagos processzorok megjelenésének szükségszerűsége (3)

Figure: Overview of Intel’s Tick-Tock model and the related MP servers [24]

TICK Pentium 4 /Prescott)

TOCK Pentium 4 /Irwindale) 90nm

11/2005: First DC MP Xeon

1Q/2009

7100 (Tulsa)

7300 (Tigerton QC)

7400 (Dunnington)

7xxx (Beckton)

(Potomac)

7000 (Paxville MP)

(Cransfield)

7200 (Tigerton DC)

2x1 C 1 MB L2/C 16 MB L3

2x2 C 4 MB L2/C

1x6 C 3 MB L2/2C 16 MB L3

1x8 C ¼ MB L2/C 24 MB L3

1x1 C 8 MB L2

2x1 C ½ MB L2/C

1x1 C 1 MB L2

1x2 C 4 MB L2/C

3/2005: First 64-bit MP Xeons

dezső sima multicore and manycore processors december 2008 overview and trends

intels multicore processors

related mp servers

mp servers2

mp xeons90nmtick pentium

dc mp xeon1q200932005

number of transistors

7x2 years

prescott tock pentium

Documents

iii. multicore processors (2) dezső sima spring 2007 (ver....

sima dezső 2007 őszi félév (ver. 2.1) dezső sima, 2007...

sima dezső manycore processors 2015. october version 6.2

dezső sima © dezső sima 2015 (v1.1) arm’s processor...

dezső sima fall 2007

sima dezső 2015 october (ver. 2.1) sima dezső, 2015 the...

iii. multicore processors (1) dezső sima spring 2007 (ver....

multithreaded processors dezső sima spring 2007 (ver. 2.1) ...

dezső sima © dezső sima 2015 (v1.0) arm’s processor...

dezső sima spring 2008

sima dezső

dezső sima 2011 november (ver. 1.4) sima dezső, 2011...

iii. multicore processors (3) dezső sima spring 2007 (ver....

dezső sima 2012 mai (ver. 1.5) sima dezső, 2012 platforms...

sima dezső Óbudai egyetem 2013 november (ver. 1.1) sima...

dezső sima spring 2008 (ver. 1.0) sima dezső, 2008...

dezső sima september 2008 (ver. 1.0) sima dezső, 2008 5....

dezső sima fall 2007

sima dezső 2014 december (ver. 2.1) sima dezső, 2014 the...

dezső sima