45-year cpu evolution: 1 law -2 equationswp3workshop.website/pdfs/wp3_etiemble_presentation.pdf ·...

17
45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI – Université Paris Sud 4004 PowerPC 601 Pentium 4 Prescott 8086 Xeon X7560 Nvidia Pascal Power9 1971 1978 1992 2004 2010 2017 2016

Upload: others

Post on 12-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • 45-year CPU Evolution: 1 Law -2 Equations

    Daniel Etiemble

    LRI – Université Paris Sud

    4004 PowerPC 601 Pentium 4 Prescott

    8086

    Xeon X7560 Nvidia PascalPower9

    19711978 1992

    2004

    2010 2017 2016

  • • Are there some fundamental rules?

    – Moore’s Law

    – Program execution time

    – CMOS power dissipation

    Exponential Evolution

    WP3 - 2018 Daniel Etiemble 2

    ���= �� ∗ (������ + ������) ∗ �� = ��

    ��� ∗�

    Hennessy-Patterson

    Instruction count Avg Cycles/InstructionComputing – Waiting data

    Cycle time

    �� = ��� ∗ �������� + � ∗ � �� ∗ ���� ∗ �

  • Moore’s law

    3WP3 - 2018 Daniel Etiemble

    Nb transistors/chip doubles everyN months (12/18/24)

    N increases

    2010 2020 2030

  • Technological nodes

    • From one node to the next one – (rough approximation)

    – Gate delay /1.4 => Higher clock frequencies

    – Increase of transistor count per area unit.

    • Consequences– From multi-chips to one chip

    – Larger on-chip memories

    – More functionalities per chip

    WP3 - 2018 Daniel Etiemble 4

    3000

    15001000 800

    600350

    250180

    13090

    6545

    3222

    1410

    1

    10

    100

    1000

    10000

    1975198219851989199419951998199920002002200620082010201220142017

    CMOS technological nodes (nm)

    0% 20% 40% 60%

    DRAM capacity

    1/DRAM latency

    DRAM bandwidth

    CPU performance(after 1986)

    CPU performance (before 1987)

    CPU clock frequency

    But MISMATCHES

    Cache hierarchies

  • Evolution of cache hierarchies

    WP3 - 2018 Daniel Etiemble 5

    Memory

    L1-Dcache

    L1-I cache

    UnifiedL2

    cache

    CPU

    Memory

    L1-Dcache L2

    CacheL1-I

    cache

    CPU

    MemoryCacheCPU

    386

    Pentium

    Pentium 4

    Memory

    L2 cache L3

    Cache

    L1-IDcache

    L2 cache

    L1-IDcache

    Power6

    8 cores Memory

    L2 cache

    L4Cache

    L1-IDCPU

    L2 cache

    L1-IDCPU

    Power 8

    L3 cache

    L3 cache

    2 cores

    �������� ������Latency & Bandwidth

    NUCA

  • Improving Performance ���= ��

    ��� ∗�

    WP3 - 2018 Daniel Etiemble 6

    Increase F (Technological nodes)- Gate delay /1.4 => Higher clock frequencies- Increased Performance

    Increase F- Increased Power and Power density

    �� = ��� ∗ �������� + � ∗ � �� ∗ ���� ∗ �

    HEAT WALL

    2017 : F in the 3-4 GHz rangeExcept water cooled IBM z14 CPU (5.2 GHz)

    Frequency

  • Improving Performance ���= ��

    ��� ∗�

    • Scalar CPU (IPC 1)– In-order CPUs

    – Out-of-order CPU

    • VLIW CPU (IPC >1)

    • Intrinsic limits of ILP in a sequential program & “ HW diminishing return”.– Larger buffer to extract

    µops to launch, but launching width remains constant !

    WP3 - 2018 Daniel Etiemble 7

    1

    4

    16

    64

    256

    PENTIUM PRO PENTIUM4 SANDY BRIDGE HASWELL

    ROB µop/cycle

    Reservation stations Int. Physical Registers

    ROB Renaming Register renaming

    1995 2000 2011 2013

    3 to 4 µops/clock

    ILP in mono-processor CPU

    Fetch Decode

    Instruction window

    FU

    FU

    FU

    ReorderBuffer Retirement

    Out-of-order

  • Improving Performance ���= ��

    ��� ∗�

    WP3 - 2018 Daniel Etiemble 8

    Data Parallelism in mono-processor CPU

    registers

    Instruction Decoder and Warp Scheduler

    SIMT1 instruction for severalthreads

    registers registers registers registers registers

    thread

    GPU

    X0X1X2X3

    Y0Y1Y2Y3

    op op op op

    X0 op Y0X1 op Y1X2 op Y2X3 op Y3

    source 1/dest.

    source 2

    source 1/dest.

    CPUSIMD1 instruction with several data

    SSE2/3/4 – Neon – AltivecAVX – AVX2 – AVX 512…

  • CPU + GPU

    • Two different programming models

    • Two chips or one chip?

    – CPU + GPU or APU

    Séminaire CNAM – 15/12/2015 Daniel Etiemble 9

    CPU

    CPU Memory

    GPU Memory

    GPU

    PCI express

    2014

  • Improving Performance ���= ��

    ��� ∗�

    WP3 - 2018 Daniel Etiemble 10

    Data or/and Thread Parallelism in Parallel Architectures

    Switching from sequential to parallel programming- Limited to servers and super-computers during the « free lunch » period

    Multiprocessors Multi-computers

    OpenMPPthreads

    MPI

    Dispatch IC among several processors (or cores)- Except for simple cases or embarrasingly parallel applications,

    the dispatch depends on the architecture, the programming model, Amdahl law, etc.

    - In some architectures, communication times must be included.

  • Free lunch… (by Intel)

    WP3 - 2018 Daniel Etiemble 11

    F

    IPC

    IC (SIMD)

  • From mono-processors to multi-cores

    • Performance –Power trade-off

    • Intrinsic limitations of mono-processors, even multithreaded

    • Multi-processors to multi-cores

    • Clusters of multi-cores.

    WP3 - 2018 Daniel Etiemble 12

  • Multithreaded CPUs

    WP3 - 2018 Daniel Etiemble 13

    ���= ��

    ��� ∗�

    - Sequential programs- Several programs (multiprogramming): IC = ∑ICi- Several threads (TLP) : IC = ∑ICi

    Fine grain multithreading- Switching in one clock from one thread to next one on

    pipeline hazards- Sun Niagara, Oracle serversSimultaneous multithreading- Issuing instructions from different threads at each clock- Intel Hyperthreading (2), IBM Power (2 to 8)

    Multithreading reduces CPIMem

    Multi-cores with multithreaded cores

  • Reducing static power dissipation

    WP3 - 2018 Daniel Etiemble 14

    �� ������ = ��� ∗ ��������

    TECHNOLOGY- Ex: Intel Tri-gate

    CIRCUITERY- Ex: Virtual Vss65-nm Xeon CPU L3 cache

  • Reducing dynamic power dissipation

    WP3 - 2018 Daniel Etiemble 15

    �� ������� = � ∗ � �� ∗ ���� ∗ �

    Power domains

    Clock domains

    Vdd

    F

    Several Operating Modes per blocks:Ex: 5th generationIntel Cores

  • WP3 - 2018 Daniel Etiemble 16

    What about future?

    • Moore’s law: towards fundamental limits• Execution time:

    – F: significant changes are doubtful– IPC:

    • Limits of ILP in cores• New PIM architectures ? (Data access issues: “Memory Wall”)

    – IC continues to decrease• more job per instruction

    – SIMD width (256-512-1024…) and data size (16-bit FP)– New 2D instructions: Tensor cores (Nvidia), Matrix Multiplication Unit

    (Google TPU)…

    • More cores– From multi-cores to many-cores. Exponential increase of core number???

    • Power dissipation– As long as CMOS will be used…

  • Concluding remarks

    • Main trends (not details) of CPU evolution can be explained by – Moore’s law

    – ���= �� ∗ (������ + ������) ∗ �� = ��

    ��� ∗�

    – �� = ��� ∗ �������� + � ∗ ∑ �� ∗ ���� ∗ �

    • Valid for software programmable processors as long as CMOS technology will be used.

    • Mixed HW-SW architectures (FPGA) are more complex to modelize.

    • Evolution of CPU architectures is driven by new specific applications (AI, IoT, …).

    • Business… (as usual). Ex: “proprietary” versus “open-source” ISAs.

    WP3 - 2018 Daniel Etiemble 17