dezső sima multicore and manycore processors december 2008 overview and trends
Post on 27-Dec-2015
218 Views
Preview:
TRANSCRIPT
Dezső Sima
Multicore and Manycore Processors
December 2008
Overview and Trends
Overview
1. Overview•
2. Homogeneous multicore processors •
3. Heterogeneous multicore processors•
2.1 Conventional multicores•
3.1 Master/slave architectures•
3.2 Attached processor architectures•
4. Outlook•
2.2 Manycore processors•
1. Overview – inevitability of multicores
Figure: Evolution of Intel’s IC fab technology [1]
1. Overview – inevitability of multicores (1)
Shrinking: ~ 0.7/2 Years
1. Overview – inevitability of multicores (2)
IC fab technology
Moore’s rule
• same number of transistors: on ½ Si die area
Shrinking ~ 0.7x/2 years)
• on the same die area: 2x as many transistors
every two years
Doubling transistor counts ~ every two years (on the chips)
(2. formulation: from 1975)
Utilization of the surplus transistors?
Wider processor width
superscalar1. Gen. 2. Gen.
1 2 4
pipeline
Doubling transistor counts ~ every two years
1. Overview – inevitability of multicores (3)
1. Overview – inevitability of multicores (4)
Figure: Parallelism available in applications [2]
Available parallelism in general purpose apps: ~ 4-5
Utilization of the surplus transistors?
Wider processor width Core enhancements Cache enhancements
superscalar
• branch prediction• speculative loads• ...
L2/L3enhancements
(size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Doubling transistor counts ~ every two years
1. Overview – inevitability of multicores (5)
The best use of surplus transistors is: multiple cores
The inevitability of multicore processors
Increasing transistor count Diminishing return in performance
1. Overview – inevitability of multicores (6)
with doubling of core numbers ~ every two years
Figure: Spreading Intel’s multicore processors [3]
1. Overview – inevitability of multicores (7)
1. Overview – inevitability of multicores (8)
Figure 1.1: Main classes of multicore/manycore processors
Desktops
Heterogenous multicores
Homogenous multicores
Multicore processors
Manycore processors
Servers
with >8 cores
Conventionalmulticores
Master/slavearchitectures
Add-onarchitectures
MPC
CPU GPU
2 ≤ n ≤ 8 cores
General purpose computing
Prototypes/ experimental systems
MM/3D/HPCproduction stage
HPCnear future
2. Homogeneous multicores
2.1 Conventional multicores•
2.2 Manycore processors•
2. Homogeneous multicores
Desktops
Heterogenous multicores
Homogenous multicores
Multicore processors
Manycore processors
Servers
with >8 cores
Conventionalmulticores
Master/slavearchitectures
Add-onarchitectures
MPC
CPU GPU
2 ≤ n ≤ 8 cores
General purpose computing
Prototypes/ experimental systems
MM/3D/HPCproduction stage
HPCnear future
Figure 2.1: Main classes of multicore/manycore processors
2.1 Conventional multicores
Multicore MP servers•
Intel’s multicore MP servers•
AMD’s multicore MP servers•
2.1 Intel’s multicore MP servers (1)
Figure 2.1.1: Intel’s Tick-Tock development model [13]
The evolution of Intel’s basic microarchitecture
Figure 2.1.2: Overview of Intel’s Tick-Tock model and the related MP servers [24]
11/2005: First DC MP Xeon
1Q/2009
7100 (Tulsa)
7300 (Tigerton QC)
7400 (Dunnington)
7xxx (Beckton)
(Potomac)
7000 (Paxville MP)
(Cransfield)
7200 (Tigerton DC)
2x1 C 1 MB L2/C 16 MB L3
2x2 C 4 MB L2/C
1x6 C 3 MB L2/2C 16 MB L3
1x8 C ¼ MB L2/C 24 MB L3
1x1 C 8 MB L2
2x1 C ½ MB L2/C
1x1 C 1 MB L2
1x2 C 4 MB L2/C
3/2005: First 64-bit MP Xeons
90nm
TICK Pentium 4 /Prescott)
TOCK Pentium 4 /Irwindale)
2.1 Intel’s multicore MP servers (2)
Intel’s Tick-Tock model for MP servers
Figure 2.1.3: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)
Preceding NBs
Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1
2.1 Intel’s multicore MP servers (3)
SC SC SC SC
1 Xeon MP before Potomac
Typically HI 1.5(266 MB/s)
System architecture (before Potomac)
MP Platforms
Xeon 7000
11/2005
MP Cores Xeon 7100
8/2006
MP Chipsets
3/2005 4/2006
8500 8501
(Paxville MP DC) (Tulsa DC)
(Twin Castle) (?)
Figure 2.1.4: Intel’s Xeon-based MP server platforms
2xFSB667 MT/s
4 x XMB(2 x DDR2)
32GB
2xFSB800 MT/s
4 x XMB(2 x DDR2)
32GB
Truland
65 nm/1328 mtrs
2x1 MB L216/8/4 MB L3
800/667 MT/smPGA 604
P4-based/65 nm
3/2005
Xeon MP
3/2005
(Potomac SC)
90 nm/2x169 mtrs
2x1 (2) MB L2-
800/667 MT/smPGA 604
90 nm/675 mtrs
1 MB L28/4 MB L3
667 MT/smPGA 604
P4-based/90 nm
Truland
2.1 Intel’s multicore MP servers (4)
First 64-bit server
Figure 2.1.5: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)
Preceding NBs
Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1
2.1 Intel’s multicore MP servers (5)
SC SC SC SC
1 Xeon MPs before Potomac
Typically HI 1.5(266 MB/s)
Truland
DC
• Cransfield SC)• Tulsa (DC)
3 The 8500 supports also
2 First x86-64 MP processor
Up to 2005 2005
Serial link
(Twin Castle)
XMB
XMB
XMB
XMB
8500/8501
28 PCIe lanes + HI 1.5
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
(266 MT/s)(7 GT/s)
External Memory Bridge
MP Platforms
Xeon 7000
11/2005
MP Cores Xeon 7200 Xeon 7300Xeon 7100
9/20078/2006
MP Chipsets
3/2005 4/2006 9/2007
8500 8501 7300
(Paxville MP DC) (Tulsa DC) (Tigerton DC) (Tigerton) QC
Caneland
9/2007
(Clarksboro)(Twin Castle) (?)
Figure 2.1.6: Intel’s Xeon-based MP server platforms
2xFSB667 MT/s
4 x XMB(2 x DDR2)
32GB
2xFSB800 MT/s
4 x XMB(2 x DDR2)
32GB
4xFSB1066 MT/s
4 x FBDIMM(DDR2)512GB
Truland
Xeon 7400
9/2008
(Dunnington 6C)
65 nm/1328 mtrs
2x1 MB L216/8/4 MB L3
800/667 MT/smPGA 604
65 nm/2x291 mtrs
2x4 MB L2-
1066 MT/smPGA 604
65 nm/2x291 mtrs
2x(4/3/2) MB L2-
1066 MT/smPGA 604
45 nm/1900 mtrs
9/6 MB L216/12/8 MB L3
1066 MT/smPGA 604
P4-based/65 nm Core2-based/65 nm Core2-based/45 nm
3/2005
Xeon MP
3/2005
(Potomac SC)
90 nm/2x169 mtrs
2x1 (2) MB L2-
800/667 MT/smPGA 604
90 nm/675 mtrs
1 MB L28/4 MB L3
667 MT/smPGA 604
P4-based/90 nm
Truland Caneland
7300
2.1 Intel’s multicore MP servers (6)
Figure 2.1.7: Evolution of Intel’s Xeon MP-based system architecture
(until the appearance of Nehalem)
Preceding NBs
Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1
(Clarksboro)
Tigerton Tigerton Tigerton Tigerton
2.1 Intel’s multicore MP servers (7)
6C/QC/DC 6C/QC/DC 6C/QC/DC 6C/QC/DC
SC SC SC SC
Dunnington Dunnington Dunnington Dunnington
8 PCI-E lanes + ESI
Truland
Caneland
7300
1 Xeon MP before Potomac • Cransfield SC)• Tulsa (DC)
3 The 6500 supports also
2 First x86-64 MP processor
Typically HI 1.5(266 MB/s)
(2 GT/s) (1 GT/s)
QC/8CFB-DIMM(DDR2)
XMB
XMB
XMB
XMB
28 PCIe lanes + HI 1.5
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
Potomac2
Paxville MP3
DC/SC
(266 MT/s)(7 GT/s)
(Twin Castle) 8500/8501DC
Up to 2005 2005
2007
2.1 Intel’s multicore MP servers (8)
Figure 2.1.8: Nehalem’s key innovations concerning the system architecture [22]
Nehalem’s key innovations concerning the system architecture (11/2008)
2.1 Intel’s multicore MP servers (9)
Figure 2.1.9: Nehalem’s key innovations concerning the system architecture [22]
Nehalem’s key innovations concerning the system architecture (11/2008)
Beckton 8C
Beckton8C
Beckton 8C
Beckton 8C
2.1 Intel’s multicore MP servers (10)
QPI QPI
QPIQPI
QPI
QPI
QPI: QuickPath Interconnect
QPIQPI
Figure 2.1.10: Intel’s Nehalem based MP server architecture
4xFB-DIMM
11/2008: Nehalem
AMD’s multicore MP servers•
2.1 AMD’s multicore MP servers (1)
AMD Direct Connect Architecture (2003)
• Integrated Memory Controller• Serial HyperTransport links
Figure 2.1.11: AMD’s Direct Connect Architecture [14]
Remark
• 3 HT 1.0 links at introduction (K8),• 4 HT 3.0 links with K10 (Barcelona)
Introduced in 2003 along with the x86-64 ISA extension
(Intel: 2008 with Nehalem)
2.1 AMD’s multicore MP servers (2)
Use of available HyperTransport links [44]
UPs
Each link supports connections to I/O devices
DPs
Two links support connections to I/O devices,any one of the three links may connect to another DP or MP processor
MPs
Each link supports connections to I/O devices or other DP or MP processors
AMDOpteron
AMDOpteron
PCI-X
PCI Express
AMDOpteron
AMDOpteron
PCI
AMDOpteron
AMDOpteron
PCI-X
I/O
RDD2
RDD2
RDD2 RDD2
RDD2
RDD2
HT HT
HT
HT
HT
HT
HT
HT
HT
Figure 2.1.12: 2P and 4P server architectures based on AMD’s Direct Connect Architecture [15], [16]
2.1 AMD’s multicore MP servers (3)
Figure 2.1.13: Block diagram of Barcelona (K10) vs K8 [17]
(K10)
2.1 AMD’s multicore MP servers (4)
Figure 2.1.14: Possible use of Barcelona’s four HT 3.0 links [39]
2.1 AMD’s multicore MP servers (5)
Novel features of HT 3.0 links, such as
Current platforms (2. Gen. Socket F with available chipsets) do not support HT3.0 links [46].
• higher speed or• splitting a 16-bit HT link to two 8-bit links
can be utilized only with a new platform.
2.1 AMD’s multicore MP servers (6)
Figure 2.1.15: AMD’s roadmap for server processors and platforms [19]
2.1 AMD’s multicore MP servers (7)
2.2 Manycore processors
Desktops
Heterogenous multicores
Homogenous multicores
Multicore processors
Manycore processors
Servers
with >8 cores
Conventionalmulticores
Master/slavearchitectures
Add-onarchitectures
MPC
CPU GPU
2 ≤ n ≤ 8 cores
General purpose computing
Prototypes/ experimental systems
MM/3D/HPCproduction stage
HPCnear future
2.2 Manycore processors
Figure 2.2.1: Main classes of multicore/manycore processors
2.2 Manycore processors
Intel’s Larrabee•
Intel’s Tiled processor•
Larrabee
Part of Intel’s Tera-Scale Initiative.
Project started ~ 2005First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27])Due in ~ 2009
• Performance (targeted): 2 TFlops
• Brief history:
• Objectives:
Not a single product but a base architecture for a number of different products.High end graphics processing, HPC
2.2 Intel’ Larrabee (1)
Figure 2.2.2: Block diagram of the Larrabee [4]
Basic architecture
• Cores: In order, 4-way multithreaded x86 IA cores, augmented with SIMD-16 capability
• L2 cache: fully coherent
• Ring bus: 1024 bits wide
2.2 Intel’ Larrabee (2)
Figure 2.2.5: Larrabee vs the Pentium [11]
• 64-bit instructions
• 4-way multithreaded (with 4 register sets)
• addition of a 16-wide (16x32-bit) VU
• increased L1 caches (32 KB vs 8 KB)
• access to its 256 KB local subset of a coherent L2 cache
• ring network to access the coherent L2 $ and allow interproc. communication.
Main extensions
2.2 Intel’ Larrabee (3)
Figure 2.2.3: Block diagram of the Vector Unit [5]
The Vector Unit
VU scatter-gather instructions
(load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly).
8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty.
Numeric conversions
L1 D$ becomesas an extension of the register file
Mask registers
have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.
2.2 Intel’ Larrabee (4)
Figure 2.2.4: Layout of the 16-wide vector ALU [5]
• ALUs execute integer, SP and DP FP instructions• Multiply-add instructions are available.
ALUs
2.2 Intel’ Larrabee (5)
Figure 2.2.6: System architecture of a Larrabee based 4-processor MP server [6]
2.2 Intel’ Larrabee (6)
CSI: Common Systems Interface (Serial packet-based bus)
2.2 Intel’ Larrabee (7)
Programming of Larrabee [5]
• Larrabee has x86 cores with an unspecified ISA extension,
2.2 Intel’ Larrabee (8)
Figure 2.2.7: Intel’s ISA extensions [11]
AES: Advanced Encryption Standard
AVX: Advanced Vector Extension
FMA: FP fused multiply-add instr. supporting 256-bit/128-bit SIMD
2.2 Intel’ Larrabee (9)
Programming of Larrabee [5]
• Larrabee has x86 cores with an unspecified ISA extension,
• x86 cores allow to program Larrabee as usual x86 processors, by using enhanced C/C++ compilers from MS, Intel, GCC etc.
• this is a huge advantage compared to the competition (Nvidia, AMD/ATI),
Intel’s Tiled processor•
• First implementation of Intel’s Tera-Scale Initiative
Announced at IDF Fall 2006 9/2006Details at ISSCC 2007 2/2007Due to 2009/2010
• Aim: Tera-Scale research chip
- high bandwidth interconnect - energy management - programming manycore processors
(among more than 100 projects)
• Milestones of the development:
Tiled Processor
2.2 Intel’s Tiled processzor (1)
Remark
Based on ideas of the Raw processor (MIT)
Figure 2.2.8: Basic structure of the Tiled Processor [7]
2.2 Intel’s Tiled processzor (2)
2 single precision FP (Multiply-Add)
Figure 2.2.9: Block diagram of a tile [7], [9]
VLIW microarchitecture?
2.2 Intel’s Tiled processzor (3)
(For debugging)
SP FP cores
2.2 Intel’s Tiled processzor (4)
Figure 2.2.10: Die shot
of the Tiled Proc.[8]
Figure 2.2.13: Ring based interconnect network topology [7]
2.2 Intel’s Tiled processzor (5)
Figure 2.2.14: Mesh interconnect topology [7]
2.2 Intel’s Tiled processzor (6)
Figure 2.2.11: Integration of dedicated hardware units (accelerators) [7]
2.2 Intel’s Tiled processzor (7)
2.2 Intel’s Tiled processzor (8)
Figure 2.2.12: Sleeping inactivated cores [7]
2.2 Intel’s Tiled processzor (9)
Figure 2.2.15: Performance figures of the Tiled Processor [7]
Matrix multiplication (Single Precision)
Peak performance
4 SP FP/cycle
at 4 GHz:
1.6 TFLOPS
3. Heterogenous multicores
3.1 Master/slave architectures•
2.2 Attached architectures•
3. Heterogenous multicores
Figure 3.1: Main classes of multicore processors
Desktops
Heterogenous multicores
Homogenous multicores
Multicore processors
Manycore processors
Servers
with >8 cores
ConventionalMC processors
Master/slavearchitectures
Add-onarchitectures
MPC
CPU GPU
2 ≤ n ≤ 8 cores
General purpose computing
Prototypes/ experimental systems
MM/3D/HPCproduction stage
HPCnear future
3.1 Master/slave architectures
The Cell BE•
3.1 The Cell BE (1)
Computational model
Master-slave computational model with cacheless private memory spaces (LSs)
allow efficient utilization of the die area for computations
• transferring the tasks (programs and data) from the master to the slaves and the results back from the slaves to the master
• Master slave computational model
allows to delegate tasks to dedicated, task-efficient units
• synchronization between the master and the
slaves• inter-core communication and synchronization • Cacheless private memory spaces
needs efficient mechanisms for
needs an efficient LS-based microarchitecture for the slaves
Performance @ 3.2 GHz:
QS21 Peak performance (SP FP): 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle)
3.1 The Cell BE (2)
3.1 The Cell BE (3)
Figure 3.1.2: Cell roadmap from 2007 [22]
3.2 Attached architectures
Figure 3.2.1: Main classes of multicore/manycore processors
Desktops
Heterogenous multicores
Homogenous multicores
Multicore processors
Manycore processors
Servers
with >8 cores
ConventionalMC processors
Master/slavearchitectures
Add-onarchitectures
MPC
CPU GPU
2 ≤ n ≤ 8 cores
General purpose computing
Prototypes/ experimental systems
MM/3D/HPCproduction stage
HPCnear future
3.2 Attached architectures
3.2 Attached architectures
Introduction to GPGPUs•
The SIMT computational model (CM)•
Recent implementations of the SIMT CM•
Intel’s future processors with attached architecture•
AMD’s future processors with attached architecture•
Introduction to GPGPUs•
Figure 3.2.2: Evolution of the microarchitecture of GPUs [23]
3.2 Introduction to GPGPUs (1)
Evolution of the microarchitecture of GPUs
Figure 3.2.3: Simplified block
diagram of AMD/ATI’s RV770 [24]
160 cores x 5 execution units
3.2 Introduction to GPGPUs (2)
Figure 3.2.4: Simplified structure of a core of the RV770 GPGPU [24]
Execution units (Stream Processing Units)
• 32-bit FP (ADD, MUL, MADD) • 64-bit FP • 32-bit FX . . .
3.2 Introduction to GPGPUs (3)
3.2 Introduction to GPGPUs (4)
Figure 3.2.5: Peak SP FP performance figures Nvida’s GPUs vs Intel’s CPUs [25]
3.2 Introduction to GPGPUs (5)
Figure 3.2.6: Bandwidth figures: Nvidia’s GPUs vs Intel’s CPUs [GB/s] [25]
Not cached
Figure 3.2.7:Utilization of the die are in CPUs vs GPUs [25]
3.2 Introduction to GPGPUs (6)
Based on their FP32 computing capability and the large number of execution units available
GPUs with unified shader architecture are prospective candidates for speeding up HPC!
GPUs with unified shader architectures also termed as
GPGPUs
(General Purpose GPUs)
3.2 Introduction to GPGPUs (7)
For HPC computations SIMT (Single Instruction Multiple Treads) computation model
Use of GPUs for HPC
The SIMT computational model (CM)•
Main alternatives of data parallel execution
Data parallel execution
SIMD execution SIMT execution
• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors
• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)
Figure 3.2.8: Main alternatives of data parallel execution
3.2 The SIMT computational model (1)
Scalar execution SIMD execution SIMT execution
Domain of execution:single data elements
Domain of execution:elements of vectors
Domain of execution:elements of matrices
(at the programming level)
Figure 3.2.9: Scope of the data parallel execution vs scalar execution (at the programming level)
Remarks
1. SIMT execution is also termed as SPMD (Single-Program Multiple-Data) execution (Nvidia)2. At the processor level two dimensional domains of execution can be mapped to any set of cores (e.g. to a line of cores).
3.2 The SIMT computational model (2)
Main alternatives of data parallel execution
Data parallel execution
SIMD execution SIMT execution
• One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors
• One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices)
E.g. 2. and 3. generationsuperscalars
GPGPUs,data parallel accelerators
Figure 3.2.10: Main alternatives of data parallel execution
• data dependent flow control as well as • barrier synchronization
• is massively multithreaded, and provides
3.2 The SIMT computational model (3)
Recent implementations of the SIMT CM•
Basic implementation alternatives of the SIMT execution
GPGPUs Data parallel accelerators
Dedicated units supporting data parallel execution
with appropriate programming environment
Programmable GPUs with appropriate
programming environments
E.g. Nvidia’s 8800 and GTX linesAMD’s HD 38xx, HD48xx lines
Nvidia’s Tesla linesAMD’s FireStream lines
Have display outputs No display outputsHave larger memories than GPGPUs
Figure 3.2.12: Basic implementation alternatives of the SIMT execution
3.2 Recent implementations of the SIMT CM (1)
GPGPUs
Nvidia’s line AMD/ATI’s line
Figure 3.2.13: GPGPU families of Nvidia and AMD/ATI’
90 nm G80
65 nm G92 G200
Shrink Enhanced arch.
80 nm R600
55 nm RV670 RV770
Shrink Enhanced arch.
3.2 Recent implementations of the SIMT CM (2)
48 ALUs
6/08
65 nm/1400 mtrs
11/06
90 nm/681 mtrs
Cores
Cards
CUDA
Cores
G80
2005 2006 2007 2008
96 ALUs320-bit
8800 GTS
10/07
65 nm/754 mtrs
G92
128 ALUs384-bit
8800 GTX
112 ALUs256-bit
8800 GT
GT200
192 ALUs448-bit
GTX260
240 ALUs512-bit
GTX280
6/07
Version 1.0
11/07
Version 1.1
6/08
Version 2.0
5/08
55 nm/956 mtrs
5/07
80 nm/681 mtrs
R600
11/07
55 nm/666 mtrs
R670 RV770
11/05
R500
320 ALUs512-bit
HD 2900XT
320 ALUs256-bit
HD 3850
320 ALUs256-bit
HD 3870
800 ALUs256-bit
HD 4850
800 ALUs256-bit
HD 4870Cards (Xbox)
11/07
Brook+Brooks+
RapidMind
2009
NVidia
AMD/ATI
6/08
support
3870
Figure 3.2.14: Overview of GPGPUs
3.2 Recent implementations of the SIMT CM (3)
Implementation alternatives of data parallel accelerators
On card implementation
Recent implementations
E.g. GPU cards
Nvidia’s Tesla AMD/ATI’s FireStream accelerator families
Figure 3.2.15: Implementation alternatives of data parallel accelerators
Data parallel accelerators
3.2 Recent implementations of the SIMT CM (4)
6/08
GT200-based4 GB GDDR30.936 GLOPS
6/07
G80-based1.5 GB GDDR30.519 GLOPS
Card
Desktop
IU Server
C870
2007 2008
C1060
CUDA
NVidia Tesla
6/07
G80-based2*C870 incl.3 GB GDDR31.037 GLOPS
D870
6/07
G80-based4*C870 incl.6 GB GDDR32.074 GLOPS
S870
6/07
Version 1.0
6/08
GT200-based4*C1060
16 GB GDDR33.744 GLOPS
S1070
11/07
Version 1.01
6/08
Version 2.0
Figure 3.2.16: Overview of Nvidia’s Tesla family
3.2 Recent implementations of the SIMT CM (5)
6/08
Shipped
11/07
RV670-based2 GB GDDR3
500 GLOPS FP32~200 GLOPS FP64
Card
Stream Computing SDK
9170
2007 2008
9170
Rapid Mind
AMD FireStream
6/08
RV770-based1 GB GDDR31 TLOPS FP32
~300 GFLOPS FP64
9250
12/07
Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)
Version 1.0
10/08
Shipped
9250
Figure 3.2.17: Overview of AMD/ATI’s FireStream family
3.2 Recent implementations of the SIMT CM (6)
Implementation alternatives of data parallel accelerators
On-dieintegration
On card implementation
Recent implementations
Futureimplementations
E.g. GPU cards
Nvidia’s Tesla
AMD/ATI’s FireStream
accelerator families
Intel’s HeavendahlAMD’s Fusion
integration technology
Trend
Figure 3.2.15: Implementation alternatives of data parallel accelerators
Data parallel accelerators
3.2 Recent implementations of the SIMT CM (4)
3.2 Recent implementations of the SIMT CM (7)
Figure 3.2.18: Expected evolution of attached GPGPUs [42]
Integration to the chip
Intel’s future processors with attached architecture•
3.2 Intel’s future processors with attached architecture (1)
Figure 3.2.19: Intel’s desktop roadmap [26]
Core2 i7 (Nehalem)Pentium4
Q4/08
3.2 Intel’s future processors with attached architecture (2)
Figure 3.2.20: A part of Intel’s desktop roadmap [26]
Q2/09 Q3/09Q4/08 Q1/09
(45 nm)
AMD’s future processors with attached architecture•
3.2 AMD’s future processors with attached architecture (1)
Figure 3.2.21: AMD’s view about the major phases of processor evolution [27]
6/2006 The Torrenza initiative (2006 Technology Analyst Day)
• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].
3.2 AMD’s future processors with attached architecture (4)
Figure 3.2.22: Introduction of the Torrenza platform level integration technique [40]
(cache coherent HT)
3.2 AMD’s future processors with attached architecture (3)
6/2006 The Torrenza initiative (2006 Technology Analyst Day)
• Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40].
10/2006 Acquisition of ATI
10/2006 The Fusion initiative
• Silicon level integration of accelerators to AMD processors (first Fusion processors due to the end of 2008 early 2009) [41]
3/2007 “Integration” of the Torrenza and the Fusion initiatives into a continuum of accelerated computing solutions
3.2 AMD’s future processors with attached architecture (4)
Figure 3.2.23: The Torrenza platform and the Fusion integration technologyas a continuum for accelerated computing solutions [29]
Remark: It is based on an earlier Alienware presentation from 6/2006 [38].
3.2 AMD’s future processors with attached architecture (5)
Implementation of Fusion processors
• 2007/2008 AMD made a number of confusing announcements and withdrawals [31] – [35].
• According to the latest announcements (11/2008) AMD plans to introduce 32 nm Fusion processors only in 2011 [37].
3.2 AMD’s future processors with attached architecture (6)
Figure 3.2.24: AMD’ 2008 roadmap for client processors [37]
3.2 AMD’s future processors with attached architecture (7)
4. Outlook
4. Outlook (1)
Outlook
Heterogenous multicores
Master/slavearchitectures
Add-onarchitectures
1(Ma):M(S) 2(Ma):M(S) M(Ma):M(S) 1(CPU):1(D) M(CPU):1(D) M(CPU):M(D)
Ma: MasterS: SlaveM: Many
D: Dedicated (like GPU)H: HomogenousM: Many
M(Ma) = M(CPU)
M(S) M(D)
?M(S) M(D)
Figure 4.1: Expected evolution of heterogeneous multicore processors
The future of heterogenous multicores
4. Outlook (2)
M(CPU):M(D)Heterogenous multicores:
The future of homogeneous multicores
Larrabee
Tiled processor
In fact: both are of the same type: M(CPU):M(D)
4. Outlook (3)
Figure 4.2: Simplified block diagrams of Larrabee and the Tiled processor [4], [7]
4. Outlook (4)
The main road of processor evolution M(CPU):M(D)
Thank you for your attention!
5. References
[1]: Bhandarkar D., „The Dawn of a New Era”, 11. EMEA, May 2006, Budapest,
[2]: Wall D. W., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-15.html
[3]: Loktu A.,”Itanium 2 for Enterprise Computing,” http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps
[4]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee- intels-biggest-leap-ahead-since-the-pentium-pro.html
[5]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2
[6]: Timm J.-F., “Larrabee: Fakten zur Intel Highend-Grafikkarte,” Computer Base, 2. Juni 2007, http://www.computerbase.de/news/hardware/grafikkarten/2007/juni/larrabee_fakten_ intel_highend-grafikkarte/
[7]: Shrout R., “Intel’s 80 Core Terascale Chip Explored: 4 GHz Clocks and more,” PC Perspective, Feb. 11. 2007, http://www.pcper.com/article.php?aid=363
[8]: Goto H., “Intel’s Manycore CPUs,” PC Watch, June 11. 2007, http://pc.watch.impress.co.jp/docs/2007/0611/kaigai364.htm
[9]: Hoskote Y. & al., “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro, Sept./Oct. 2007, (Vol. 27 No. 5), pp. 51-61
[10]: Taylor M. & al., “The Raw Processor,” Hot Chips Aug. 13. 2001, http://www.hotchips.org/archives/hc13/3_Tue/22mit.pdf
[11]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008, http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm
[12]: Stokes J., “Larrabee: Intel's biggest leap since the Pentium Pro,” Ars Technica, Aug. 04, 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels- biggest-leap-ahead-since-the-pentium-pro.html
[13]: Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family: Architecture Insight and Power Management , IDF Taipeh, Oct. 2008, http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF -Taipei_TPTS001_100.pdf
[14]: AMD Opteron™ Processor for Servers and Workstations, http://amd.com.cn/CHCN/Processors/ProductInformation/0,,30_118_8826_8832, 00-1.html
[15]: AMD Opteron Processor with Direct Connect Architecture, 2P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/2P_Power_PID_41497.pdf
[16]: AMD Opteron Processor with Direct Connect Architecture, 4P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/4P_Power_PID_41498.pdf
[17]: Kanter D., “Inside Barcelona: AMD's Next Generation, Real World Tech., May 16. 2007, http://www.realworldtech.com/page.cfm?ArticleID=RWT051607033728
[18]: Kanter D,, “AMD's K8L and 4x4 Preview, Real World Tech. June 02. 2006, http://www.realworldtech.com/page.cfm?ArticleID=RWT060206035626&p=1
[19]: Enderle R., AMD Shanghai “We are back! TGDaily, November 13. 2008, http://www.tgdaily.com/content/view/40176/128/
[20] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf
[21]: Wright C., Henning P., Bergen B., “Roadrunner Tutorial – An Introduction to Roadrunner and the Cell Processor,” Febr. 7 2008, http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf
[22]: Hofstee H. P., “Industry trends in Microprocessor Design,”, IBM, Oct. 4 2007, http://lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/Cell_Hofstee_Non_Conf.pdf
[23]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
[24]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[25]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia
[26]: Goto H., “Intel Desktop CPU Roadmap,” 2008, http://pc.watch.impress.co.jp/docs/2008/0326/kaigai02.pdf
[27]: The industry-Changing Imact of Accelerated Computing – Fusion White Paper, AMD, 2008, http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf
[28]: AMD Announces Initiatives To Elevate AMD64 As Platform For System- And Industry-Wide Innovation, AMD, Jine 1 2006. http://www.amd.com/us-en/Weblets/0,,7832_8366_5730~109409,00.html
[29]: Metal G., “AMD Torrenza and Fusion together,” Metalghost, March 22 2007, http://www.metalghost.ro/index.php?view=article&catid=30%3Ahardware&id =233%3Aamd-torrenza-and-fusion-together&option=com_content
IT Hardware IT NewsIT ReviewsProcessorsMotherboardsMemoriesGraphic CardsStorage DevicesDisplaysDigital Audio Devices
Web Design Ghost Web SitesMetal BandsHardware & SoftwarePetrochemicalBucovinaStep1 Media PromotionKinder LandFloridan TransMiniclip OnlineShipcare ServicesReebok Mania
Latest Articles Most Popular
[30]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[31]: Hester P., 2007 Technology Analyst Day, AMD, July 26 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf
[32]: Rivas M., 2007 Financial Analyst Day, AMD, Dec. 13. 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf
[33]: Smalley T., “Shrike is AMD’s First Fusion Platform,”, Trusted Reviews, June 9. 2008, http://www.trustedreviews.com/notebooks/news/2008/06/09/Shrike-Is-AMDs- First-Fusion-Platform/p1
[34]: Hruska J., “AMD Fusion now pushed back to 2011,” Ars Technica, Nov. 14. 2008, http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to -2011.html
[35]: Gruener W., “AMD delays Fusion processor to 2011,” TgDaily, Nov. 13. 2008, http://www.tgdaily.com/content/view/40186/135
[36]: Wilson D., “AMD Analyst Day Platform Announcements,” Anandtech, June 2. 2006, http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2
[37]:Allen R., Financial Analyst Day, AMD, Nov. 13. 2008, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/RandyAllen AMD2008AnalystDay11-13-2008.pdf
[38]: Gonzales N., 2006 Technology Analyst Day, Alienware, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf
[39]: Hester P., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf
[40]: Seyer M., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MartySeyer AMDAnalystWebv3.pdf
[41]: AMD Completes ATI Acquisition and Creates Processing Powerhouse , Oct. 25. 2006, http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543 ~113741,00.html
[42]: Stokes J., “A closer look at AMD’s CPU/GPU Fusion,” Ars Technica, Nov. 19. 2006, http://arstechnica.com/news.ars/post/20061119-8250.html
842-856 (Athens)
82xx (Santa Rosa)
8347-56 (Barcelona)
840-850 (Sledgehammer)
865-890 (Egyipt)
1x1 C 1 MB L2
1x2 C 2 MB L2/C
1x4 C 1/2 MB L2/C 2 MB L3
3 MB L3
1x1 C 1 MB L2
1x2 C 2 MB L2/2C
8378-84 (Shanghai)
Figure: AMD’s Tick-Tock model and the related Opteron MP servers
TOCK
TICK
TOCK
TICK
TICK45nm
65nm
90nm
130nm
Figure: Larrabee’s Software stack [12]
Figure: Layout of MIT’s Raw Processor [10]
3.1 The Cell BE (1)
Cell BE
02/2006 Cell Blade QS2008/2007 Cell Blade QS2105/2008 Cell Blade QS22
• Joint development of Sony, IBM and Toshiba
11/2006 Playstation 3 (PS3) QS2x Blade Server family
• 2x PSX3 performance• 12 cores/45 nm• GDDR3/DDR3 (instead of XDR)
Summer 2000: Basic decisions concerning the architecture
• Aim: Games, multimedia, in addition HPC
Rumors (9/2008):
2011? Playstation4 (Competition: XBox3)
EIB: Element Interface Bus
Figure 3.1.1: Block diagram of the Cell BE [20]
SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit
PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit
MIC Memory Interface Contr.BIC: Bus Interface Contr.
XDR: Rambus DRAM
3.1 The Cell BE (3)
Figure: Layout of the EIB [21]
3.1 Mester/szolga elvű többmagos processzorok - A Cell (4)
Figure: Concurrent data transfers over the EIB [21]
3.1 Mester/szolga elvű többmagos processzorok - A Cell (5)
Massive multithreading
Multithreading is implemented by
creating and managing parallel executable threads for each data element of the execution domain.
Figure: Threads allocated to the element of an execution domain
Same instructions for all data elements
3.2 Csatolt többmagos processzorok (10)
ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
Actual context Register file (RF)
Context switch
Figure 3.2.11: Per thread contexts needed per ALU for fast context switch
Fetch/Decode
SIMT core
3.2 The SIMT computational model (4)
Figure: Early vision on the integration of CPUs and GPUs(Presented by Alienware (performance pc maker) [38]
Alienware’s early vision on the integration of CPUs and GPUs (6/2006)
Figure: AMD’s view about the evolution of mainstream computing [30]
5.2 AMD/ATI’s GPGPU line (1)
Figure: AMD’s planned 32 nm mobile/mainstream Falcon Fusion family [31]
(32 nm brand new core)
AMD’s plans to implement Fusion class processors
The 32 nm Falcon processor with the Bulldozer CPU core (7/2007: Technology Analyst Day)
(UVD: Unified Video Decoder)
The 45 nm Swift processor family (12/2007: Financial Analyst Day)
Figure: AMD’s planned 45 nm Swift Fusion processor family [32]
(K10)
Figure: AMD’s planned 45 nm Shrike mobile platform with the Swift processor [33]
The 45 nm Shrike platform with the Swift processor (6/2008)
Nov. 2008 (Financial Analyst Day):
AMD cancelled both the 45 nm Shrike platform and the Swift processor [34], [35]
Reason
The 45 nm implementation would result only in modest improvements in performance, power and cost.
Recent plan
• 32 nm technology is awaited to implement the planned CPU GPU integration, it is due to in 2011.
5. References
[1]: Bhandarkar ic techn
[2]: Wall D., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl
[3]: Intel mc
Large-Scale Systems Modeling:Networks of QS2x Blades
Peter Altevogt, Tibor Kiss IBM STG Boeblingen
Wolfgang DenzelIBM Research Zurich
Miklos KozlovszkyBudapest Tech
Research objectives
Provide simulation infrastructure for a
detailed modification analysis of IO subsystems, networks and workloads
limited modification analysis of processor cores:as workload generators they are treated as black (grey) boxes
workload characterization is based on low-level processor core simulations or measurements
SubtasksHigh-Level Simulation Design of Networks of QS2x Blades
System representationWorkload representation
Implementation
Modeled Components
Workloadas generated by the processor cores
System components:processor cores* as workload generators
for executing computational delays
memory and IO subsystemsbus interfaces, southbridges, network adapter
networkswitches, router,...
* without bus interfaces
Network
General SetupBlades
... : requests
High-Level Simulation Design
• Blade system: hardware view
EIB1 EIB0
mem
1
mem
0
SB1 SB0
Cores0Cores1
Processor cores Southbridges
Network adapter
Buses
Memories
to/fromnetwork
Processor cores:–generating requests against IO
subsystem / network–executing computational
requests in form of delays
High-Level Simulation Design (2):
• Blade system: detailed simulation view
Processor cores (2 chips in case of
Blades)
netw
mem
1EIB1SB1
mem
0EIB0SB0
IO subsystem
Adaptive workloadgenerator
network
Workload generator:–generating requests against
IO subsystem / networkProcessor cores:
–executing computational requests in form of delays
Figure: Overview of the implementation of Intel’s Tick-Tock model for MP servers [24]
2x1 C, 1 MB L2/C 16 MB L3 7100 (Tulsa)
1x2 C, 4 MB L2/C 7200 (Tigerton DC)
2x2 C, 4 MB L2/C 7300 (Tigerton QC)
1x6 C, 3 MB L2/2C 16 MB L3 7400 (Dunnington)
1x8 C, ¼ MB L2/C 24 MB L3 7xxx (Beckton)
2. Intel’s MP servers (5)
TICK Pentium 4 /Prescott) 1x1 C, 8 MB/C (Potomac)
TOCK Pentium 4 /Irwindale) 2x1 C, ½ MB/C 7000 (Paxville MP)
1x1 C, 1 MB/C (Cransfield) 90nm
3/2005: First 64-bit MP Xeon11/2005: First DC MP Xeon
1Q/2009
2.2 ábra: Intel MP szerver lapka készleteinek fejlődése
Preceding NB
Potomac Potomac Potomac Potomac
Clarksboro
Tigerton Tigerton Tigerton Tigerton
(Twin Castle)
Paxville MPTulsa
XMB
XMB
XMB
XMB
Paxville MPTulsa
Paxville MPTulsa
Paxville MPTulsa
8500
DC/QC DC/QC DC/QC DC/QC
SC SC SC SC DC DC DC DC
2005: 2006:
2007:
DDR/DDR2
FBDIMM/DDR2
DDR/DDR2
2.1 – Intel többmagos MP szerver processzorai (2)
7300
FSB FSB
FSB
FB-DIMM DDR2
192 GB 7200 DC 7300 QC(Tigerton)
Xeon
2.3 ábra: Négyfoglalatos 7300 (Caneland) alaplap (Supermicro X7QC3)
SBE2 SB
7300 NB
2.1 – Intel többmagos MP szerver processzorai (3)
UP: Opteron 100/1000 DP: Opteron 200/2000, MP: 800/8000
CPU0
1MB L2 Cache
CPU1
System Request Interface
Crossbar Switch
MemoryController HT
1MB L2 Cache
CPU0
1MB L2 Cache
CPU1
System Request Interface
Crossbar Switch
MemoryController 0 1 2
1MB L2 Cache
HyperTransport™
2 x 72 bit 2 x 72 bit 800/8000: 3 coherent links200/2000: 1 coherent link
2.4 ábra: Az Opteron család alapvető felépítése
2.1 – AMD többmagos MP szerver processzorai (1)
2.5 ábra: AMD 4P/8P Direct Connect szerver architektúrája
2.1 – AMD többmagos MP szerver processzorai (2)
2.1 – AMD többmagos MP szerver processzorai (3)
2.6 ábra: Intel Nehalem processzcsaládjának (2008 nov.17) rendszer architektúrája
On-die Memory Controller
a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)
The PPE• is optimized to run a 32/64-bit OS• controls usually the SPEs,• complies with the 64-bit PowerPC ISA.
• the PPE is more adept at control-intensive tasks and quicker in task switching,• the SPEs are more adept at compute intensive tasks and slower at task switcing.
• are optimized to run compute intensive SIMD apps.,• operate usually under the control of the PPE,• run their individual apps. (threads),• have full access to a coherent shared memory including the memory mapped I/O-space,• can be programmed in C/C++.
The SPEs
Contrasting the PPE and the SPEs
Unique features of the Cell BE
Overview of the Cell BE (4)
b) The SPEs have an unusual storage architecture, as
• SPEs
• The LS
access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands).
operate in connection with a local store (LS) of 256 KB, i.e.
o they fetch instructions from their private LS and
o their Load/Store-instructions access their LS rather than the main store,
• SPEs
has no associated cache.
Overview of the Cell BE (5)
Figure: Die shot and floorplan the Cell BE (221mm2, 234 mtrs) [15]
3.1 Mester/szolga elvű többmagos processzorok - A Cell (3)
4. Kitekintés (1)
Processor Technology Aim
Bloomfield (45 nm) desktopBeckton (45 nm) MP serverWestmare (32 nm) desktop DP server
Cores Memory channels
4 triple channel DDR3 8 quad channel FB_DIMM (2)
4/6 triple channel DDR3 4/6 quad channel DDR3
Intel’s Nehalem (i7) family (17. Nov. 2008)
• Integrated memory controller
• 4/6/8-cores
• Dual-threaded
• FSB replaced by a serial bus (QuickPath Interconnect)
Main features
http://pc.watch.impress.co.jp/docs/2007/0122/kaigai330.htm
http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG
HTX slots will be standard interfaces connected directly to an AMD CPU's HyperTransport link. If both of these links are coherent, the device and the CPU will be able to communicate directly with each other with cache coherency. Because of this, latency can be reduced greatly over other buses as well, enabling hardware vendors to begin to create true coprocessor technology once again.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2
http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf
Fusion announced in Oct. 2006, due to in 1H 2008.
http://www.google.com/search?hl=en&q=amd+fusion+pcwatch&btnG=Google+Search&aq=f&oq=
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007_AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf
32nm brand new 32 nm core
http://download.amd.com/Corporate/MarioRivasDec2007AMDAnalystDay.pdf
fusion constraintsdie size
dissipationmemory bandwidth
Phil HesterFusion will never go to high end due to dissipation
AMD's CPU die size of the high-end desktop CPU about 200 square mm, with the main stream CPU 120-150 sq mm, the value CPU around 100 square mm or less. Therefore, FUSION half die (semiconductor units) as a spare GPU core, the size of the core GPU with a degree can be constrained. GPU die, the high-end GPU more than 300 square mm, with midrange GPU 120-150 square mm, Value-GPU around 100 square mm or less. Therefore, 45nm process can be integrated into FUSION-generation GPU core, 65nm-generation below the rank of the discrete GPU will be the size and extent
cpu uses commodity dram gpu graphics dram gddr3, 4, 5
memory size bandwidth mem. data path 8 B 32/64 B
http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm
coexistence of torrenza and fusion (high end: torrenza)
を真剣に考えているようだ。
FUSIONプロセッサの想定図
PDF版はこちら
http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG
http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to-2011.html
Nov 14 20008
http://www.techpowerup.com/reviews/AMD/Analysts_Day
PC Watch, 07.01.31)
http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm
Memory bandwidth per 10GFLOPS needs 1GB/sec about and will be calculated.
The 45 nm Fusion processor, initially promised as a 2009 chip and then moved into 2010 is essentially cancelled.
The chip, which was described to combine a CPU and GPU under one hood in the “Shrike” core, was found to only bring modest improvements over today’s platforms in terms of power efficiency, cost and performance. Instead, the company will introduce Fusion (which actually isn’t called Fusion anymore) as a 2011 model in a 32 nm version with Llano core. Allen said that 32 nm would be the right technology to introduce the product. Llano will feature four cores, 4 MB of cache, DDR3 memory support and an integrated GPU.
http://www.tgdaily.com/content/view/40186/135/
Nov 13 2008
Possible use of surplus transistors
Wider processor width Core enhancements Cache enhancements
superscalar
• branch prediction• speculative loads• ...
L2/L3enhancements
(size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Doubling transistor counts ~ every two years
Utilization of the surplus transistors?
Moore’s rule
1. Többmagos processzorok megjelenésének szükségszerűsége (3)
Figure: Overview of Intel’s Tick-Tock model and the related MP servers [24]
TICK Pentium 4 /Prescott)
TOCK Pentium 4 /Irwindale) 90nm
11/2005: First DC MP Xeon
1Q/2009
7100 (Tulsa)
7300 (Tigerton QC)
7400 (Dunnington)
7xxx (Beckton)
(Potomac)
7000 (Paxville MP)
(Cransfield)
7200 (Tigerton DC)
2x1 C 1 MB L2/C 16 MB L3
2x2 C 4 MB L2/C
1x6 C 3 MB L2/2C 16 MB L3
1x8 C ¼ MB L2/C 24 MB L3
1x1 C 8 MB L2
2x1 C ½ MB L2/C
1x1 C 1 MB L2
1x2 C 4 MB L2/C
3/2005: First 64-bit MP Xeons
top related