-
45-year CPU Evolution: 1 Law -2 Equations
Daniel Etiemble
LRI – Université Paris Sud
4004 PowerPC 601 Pentium 4 Prescott
8086
Xeon X7560 Nvidia PascalPower9
19711978 1992
2004
2010 2017 2016
-
• Are there some fundamental rules?
– Moore’s Law
– Program execution time
– CMOS power dissipation
Exponential Evolution
WP3 - 2018 Daniel Etiemble 2
���= �� ∗ (������ + ������) ∗ �� = ��
��� ∗�
Hennessy-Patterson
Instruction count Avg Cycles/InstructionComputing – Waiting data
Cycle time
�� = ��� ∗ �������� + � ∗ � �� ∗ ���� ∗ �
-
Moore’s law
3WP3 - 2018 Daniel Etiemble
Nb transistors/chip doubles everyN months (12/18/24)
N increases
2010 2020 2030
-
Technological nodes
• From one node to the next one – (rough approximation)
– Gate delay /1.4 => Higher clock frequencies
– Increase of transistor count per area unit.
• Consequences– From multi-chips to one chip
– Larger on-chip memories
– More functionalities per chip
WP3 - 2018 Daniel Etiemble 4
3000
15001000 800
600350
250180
13090
6545
3222
1410
1
10
100
1000
10000
1975198219851989199419951998199920002002200620082010201220142017
CMOS technological nodes (nm)
0% 20% 40% 60%
DRAM capacity
1/DRAM latency
DRAM bandwidth
CPU performance(after 1986)
CPU performance (before 1987)
CPU clock frequency
But MISMATCHES
Cache hierarchies
-
Evolution of cache hierarchies
WP3 - 2018 Daniel Etiemble 5
Memory
L1-Dcache
L1-I cache
UnifiedL2
cache
CPU
Memory
L1-Dcache L2
CacheL1-I
cache
CPU
MemoryCacheCPU
386
Pentium
Pentium 4
Memory
L2 cache L3
Cache
L1-IDcache
L2 cache
L1-IDcache
Power6
8 cores Memory
L2 cache
L4Cache
L1-IDCPU
L2 cache
L1-IDCPU
Power 8
L3 cache
L3 cache
2 cores
�������� ������Latency & Bandwidth
NUCA
-
Improving Performance ���= ��
��� ∗�
WP3 - 2018 Daniel Etiemble 6
Increase F (Technological nodes)- Gate delay /1.4 => Higher clock frequencies- Increased Performance
Increase F- Increased Power and Power density
�� = ��� ∗ �������� + � ∗ � �� ∗ ���� ∗ �
HEAT WALL
2017 : F in the 3-4 GHz rangeExcept water cooled IBM z14 CPU (5.2 GHz)
Frequency
-
Improving Performance ���= ��
��� ∗�
• Scalar CPU (IPC 1)– In-order CPUs
– Out-of-order CPU
• VLIW CPU (IPC >1)
• Intrinsic limits of ILP in a sequential program & “ HW diminishing return”.– Larger buffer to extract
µops to launch, but launching width remains constant !
WP3 - 2018 Daniel Etiemble 7
1
4
16
64
256
PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL
ROB µop/cycle
Reservation stations Int. Physical Registers
ROB Renaming Register renaming
1995 2000 2011 2013
3 to 4 µops/clock
ILP in mono-processor CPU
Fetch Decode
Instruction window
FU
FU
FU
ReorderBuffer Retirement
Out-of-order
-
Improving Performance ���= ��
��� ∗�
WP3 - 2018 Daniel Etiemble 8
Data Parallelism in mono-processor CPU
registers
Instruction Decoder and Warp Scheduler
SIMT1 instruction for severalthreads
registers registers registers registers registers
thread
GPU
X0X1X2X3
Y0Y1Y2Y3
op op op op
X0 op Y0X1 op Y1X2 op Y2X3 op Y3
source 1/dest.
source 2
source 1/dest.
CPUSIMD1 instruction with several data
SSE2/3/4 – Neon – AltivecAVX – AVX2 – AVX 512…
-
CPU + GPU
• Two different programming models
• Two chips or one chip?
– CPU + GPU or APU
Séminaire CNAM – 15/12/2015 Daniel Etiemble 9
CPU
CPU Memory
GPU Memory
GPU
PCI express
2014
-
Improving Performance ���= ��
��� ∗�
WP3 - 2018 Daniel Etiemble 10
Data or/and Thread Parallelism in Parallel Architectures
Switching from sequential to parallel programming- Limited to servers and super-computers during the « free lunch » period
Multiprocessors Multi-computers
OpenMPPthreads
MPI
Dispatch IC among several processors (or cores)- Except for simple cases or embarrasingly parallel applications,
the dispatch depends on the architecture, the programming model, Amdahl law, etc.
- In some architectures, communication times must be included.
-
Free lunch… (by Intel)
WP3 - 2018 Daniel Etiemble 11
F
IPC
IC (SIMD)
-
From mono-processors to multi-cores
• Performance –Power trade-off
• Intrinsic limitations of mono-processors, even multithreaded
• Multi-processors to multi-cores
• Clusters of multi-cores.
WP3 - 2018 Daniel Etiemble 12
-
Multithreaded CPUs
WP3 - 2018 Daniel Etiemble 13
���= ��
��� ∗�
- Sequential programs- Several programs (multiprogramming): IC = ∑ICi- Several threads (TLP) : IC = ∑ICi
Fine grain multithreading- Switching in one clock from one thread to next one on
pipeline hazards- Sun Niagara, Oracle serversSimultaneous multithreading- Issuing instructions from different threads at each clock- Intel Hyperthreading (2), IBM Power (2 to 8)
Multithreading reduces CPIMem
Multi-cores with multithreaded cores
-
Reducing static power dissipation
WP3 - 2018 Daniel Etiemble 14
�� ������ = ��� ∗ ��������
TECHNOLOGY- Ex: Intel Tri-gate
CIRCUITERY- Ex: Virtual Vss65-nm Xeon CPU L3 cache
-
Reducing dynamic power dissipation
WP3 - 2018 Daniel Etiemble 15
�� ������� = � ∗ � �� ∗ ���� ∗ �
Power domains
Clock domains
Vdd
F
�
Several Operating Modes per blocks:Ex: 5th generationIntel Cores
-
WP3 - 2018 Daniel Etiemble 16
What about future?
• Moore’s law: towards fundamental limits• Execution time:
– F: significant changes are doubtful– IPC:
• Limits of ILP in cores• New PIM architectures ? (Data access issues: “Memory Wall”)
– IC continues to decrease• more job per instruction
– SIMD width (256-512-1024…) and data size (16-bit FP)– New 2D instructions: Tensor cores (Nvidia), Matrix Multiplication Unit
(Google TPU)…
• More cores– From multi-cores to many-cores. Exponential increase of core number???
• Power dissipation– As long as CMOS will be used…
-
Concluding remarks
• Main trends (not details) of CPU evolution can be explained by – Moore’s law
– ���= �� ∗ (������ + ������) ∗ �� = ��
��� ∗�
– �� = ��� ∗ �������� + � ∗ ∑ �� ∗ ���� ∗ �
• Valid for software programmable processors as long as CMOS technology will be used.
• Mixed HW-SW architectures (FPGA) are more complex to modelize.
• Evolution of CPU architectures is driven by new specific applications (AI, IoT, …).
• Business… (as usual). Ex: “proprietary” versus “open-source” ISAs.
WP3 - 2018 Daniel Etiemble 17