Download - 45-year CPU Evolution: 1 Law -2 Equationswp3workshop.website/pdfs/WP3_etiemble_presentation.pdf · WP3 -2018 Daniel Etiemble 7 1 4 16 64 256 PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL

45-year CPU Evolution: 1 Law -2 Equations

Daniel Etiemble

LRI – Université Paris Sud

4004 PowerPC 601 Pentium 4 Prescott

8086

Xeon X7560 Nvidia PascalPower9

19711978 1992

2004

2010 2017 2016

• Are there some fundamental rules?

– Moore’s Law

– Program execution time

– CMOS power dissipation

Exponential Evolution

WP3 - 2018 Daniel Etiemble 2

��= �� ∗ (�� + ��) ∗ �� = ��

�� ∗�

Hennessy-Patterson

Instruction count Avg Cycles/InstructionComputing – Waiting data

Cycle time

�� = �� ∗ �� + � ∗ � �� ∗ �� ∗ �

Moore’s law

3WP3 - 2018 Daniel Etiemble

Nb transistors/chip doubles everyN months (12/18/24)

N increases

2010 2020 2030

Technological nodes

• From one node to the next one – (rough approximation)

– Gate delay /1.4 => Higher clock frequencies

– Increase of transistor count per area unit.

• Consequences– From multi-chips to one chip

– Larger on-chip memories

– More functionalities per chip


3000

15001000 800

600350

250180

13090

6545

3222

1410

1

10

100

1000

10000

1975198219851989199419951998199920002002200620082010201220142017

CMOS technological nodes (nm)

0% 20% 40% 60%

DRAM capacity

1/DRAM latency

DRAM bandwidth

CPU performance(after 1986)

CPU performance (before 1987)

CPU clock frequency

But MISMATCHES

Cache hierarchies

Evolution of cache hierarchies


Memory

L1-Dcache

L1-I cache

UnifiedL2

cache

CPU

Memory

L1-Dcache L2

CacheL1-I

cache

CPU

MemoryCacheCPU

386

Pentium

Pentium 4

Memory

L2 cache L3

Cache

L1-IDcache

L2 cache

L1-IDcache

Power6

8 cores Memory

L2 cache

L4Cache

L1-IDCPU

L2 cache

L1-IDCPU

Power 8

L3 cache

L3 cache

2 cores

�� Latency & Bandwidth

NUCA

Improving Performance ��= ��

�� ∗�


Increase F (Technological nodes)- Gate delay /1.4 => Higher clock frequencies- Increased Performance

Increase F- Increased Power and Power density

�� = �� ∗ �� + � ∗ � �� ∗ �� ∗ �

HEAT WALL

2017 : F in the 3-4 GHz rangeExcept water cooled IBM z14 CPU (5.2 GHz)

Frequency


�� ∗�

• Scalar CPU (IPC 1)– In-order CPUs

– Out-of-order CPU

• VLIW CPU (IPC >1)

• Intrinsic limits of ILP in a sequential program & “ HW diminishing return”.– Larger buffer to extract

µops to launch, but launching width remains constant !


1

4

16

64

256

PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL

ROB µop/cycle

Reservation stations Int. Physical Registers

ROB Renaming Register renaming

1995 2000 2011 2013

3 to 4 µops/clock

ILP in mono-processor CPU

Fetch Decode

Instruction window

FU

FU

FU

ReorderBuffer Retirement

Out-of-order


�� ∗�


Data Parallelism in mono-processor CPU

registers

Instruction Decoder and Warp Scheduler

SIMT1 instruction for severalthreads

registers registers registers registers registers

thread

GPU

X0X1X2X3

Y0Y1Y2Y3

op op op op

X0 op Y0X1 op Y1X2 op Y2X3 op Y3

source 1/dest.

source 2

source 1/dest.

CPUSIMD1 instruction with several data

SSE2/3/4 – Neon – AltivecAVX – AVX2 – AVX 512…

CPU + GPU

• Two different programming models

• Two chips or one chip?

– CPU + GPU or APU

Séminaire CNAM – 15/12/2015 Daniel Etiemble 9

CPU

CPU Memory

GPU Memory

GPU

PCI express

2014


�� ∗�


Data or/and Thread Parallelism in Parallel Architectures

Switching from sequential to parallel programming- Limited to servers and super-computers during the « free lunch » period

Multiprocessors Multi-computers

OpenMPPthreads

MPI

Dispatch IC among several processors (or cores)- Except for simple cases or embarrasingly parallel applications,

the dispatch depends on the architecture, the programming model, Amdahl law, etc.

- In some architectures, communication times must be included.

Free lunch… (by Intel)


F

IPC

IC (SIMD)

From mono-processors to multi-cores

• Performance –Power trade-off

• Intrinsic limitations of mono-processors, even multithreaded

• Multi-processors to multi-cores

• Clusters of multi-cores.


Multithreaded CPUs


��= ��

�� ∗�

- Sequential programs- Several programs (multiprogramming): IC = ∑ICi- Several threads (TLP) : IC = ∑ICi

Fine grain multithreading- Switching in one clock from one thread to next one on

pipeline hazards- Sun Niagara, Oracle serversSimultaneous multithreading- Issuing instructions from different threads at each clock- Intel Hyperthreading (2), IBM Power (2 to 8)

Multithreading reduces CPIMem

Multi-cores with multithreaded cores

Reducing static power dissipation


�� = �� ∗ ��

TECHNOLOGY- Ex: Intel Tri-gate

CIRCUITERY- Ex: Virtual Vss65-nm Xeon CPU L3 cache

Reducing dynamic power dissipation


�� = � ∗ � �� ∗ �� ∗ �

Power domains

Clock domains

Vdd

F

�

Several Operating Modes per blocks:Ex: 5th generationIntel Cores


What about future?

• Moore’s law: towards fundamental limits• Execution time:

– F: significant changes are doubtful– IPC:

• Limits of ILP in cores• New PIM architectures ? (Data access issues: “Memory Wall”)

– IC continues to decrease• more job per instruction

– SIMD width (256-512-1024…) and data size (16-bit FP)– New 2D instructions: Tensor cores (Nvidia), Matrix Multiplication Unit

(Google TPU)…

• More cores– From multi-cores to many-cores. Exponential increase of core number???

• Power dissipation– As long as CMOS will be used…

Concluding remarks

• Main trends (not details) of CPU evolution can be explained by – Moore’s law

– ��= �� ∗ (�� + ��) ∗ �� = ��

�� ∗�

– �� = �� ∗ �� + � ∗ ∑ �� ∗ �� ∗ �

• Valid for software programmable processors as long as CMOS technology will be used.

• Mixed HW-SW architectures (FPGA) are more complex to modelize.

• Evolution of CPU architectures is driven by new specific applications (AI, IoT, …).

• Business… (as usual). Ex: “proprietary” versus “open-source” ISAs.


Download - 45-year CPU Evolution: 1 Law -2 Equationswp3workshop.website/pdfs/WP3_etiemble_presentation.pdf · WP3 -2018 Daniel Etiemble 7 1 4 16 64 256 PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL

Top Related