multicore architectures michael gerndt. development of microprocessors transistor capacity doubles...

Multicore Architectures Michael Gerndt Slide 2 Development of Microprocessors Transistor capacity doubles every 18 months Intel Slide 3 Development of Microprocessors Moores Law Estimated to stay at least in next 10 years But: Transistor count Power How to use transistor resources? Better execution core Enhance pipelining, superscalarity, Better vector processing (SIMD, like MMX/SSE) Problem: Gap to memory speed Larger Caches Improves memory access speed More execution cores Problem: Gap to memory speed Slide 4 Development of Microprocessors Objective for manufactures As much profit as possible: Sell processors Customers only buy when applications run faster Increase CPU power How to increase CPU power Higher clock rate More parallelism Instruction Level Parallelism (ILP) Thread Level Parallelism (TLP) Slide 5 Development of Microprocessors Higher clock rates increase power consumption proportional to f and U higher frequency needs higher voltage Small structures: Energy loss by leakage increase heat output and cooling requirements limit chip size (speed of light) at fixed technology (e.g. 60 nm) Smaller number of transistor levels per pipeline stage possible More, simplified pipeline stages (P4: >30 stages) Higher penalty of pipeline stalls (on conflicts, e.g. branch misprediction) Slide 6 Development of Microprocessors More parallelism Increased bit width (now: 64 bit architectures) SIMD Instruction Level Parallelism (ILP) exploits parallelism found in a instruction stream limited by data/control dependencies can be increased by speculation average of ILP in typical programs: 6-7 modern superscalar processors can not get better Slide 7 Development of Microprocessors More parallelism Thread Level Parallelism (TLP) Hardware multithreaded (e.g. SMT: Hyperthreading) better exploitation of superscalar execution units Multiple cores Legacy software must be parallelized Challenge for whole software industry Intel moved into the tools business Slide 8 Multicore Architectures SMPs on a single chip Chip Multi-Processors (CMP) Advantage Efficient exploitation of available transistor budget Improves throughput and speed of parallelized applications Allows tight coupling of cores better communication between cores than in SMP shared caches Low power consumption low clock rates idle cores can be suspended Disadvantage Only improves speed of parallelized applications Increased gap to memory speed Slide 9 Multicore Architectures Design decisions homogeneous vs. heterogeneous specialized accelerator cores SIMD GPU operations cryptography DSP functions (e.g. FFT) FPGA (programmable circuits) access to memory own memory area (distributed memory) via cache hierarchy (shared memory) Connection of cores internal bus / cross bar connection Cache architecture Slide 10 Multicore Architectures: Examples Core L1 L2 L3 Memory Module 1 Memory Module 2 I/O Homogeneous with shared caches and cross bar Core (2x SMT) Core L1 L2 Core Local Store Local Store Core Local Store Local Store I/O Memory Module Heterogeneous with caches, local store and ring bus Slide 11 Shared Cache Design Memory Core L1 L2 Switch Memory Core L1 L2 Switch Traditional design Multiple single-cores with shared cache off-chip Core L1 L2 Switch Multicore Architecture Shared Caches on-chip Slide 12 Shared Cache Design Memory Core L1 L2 Switch Core L1 Multicore Architecture Shared Caches on-chip Slide 13 Shared Caches: Advantages No coherence protocol at shared cache level Less latency of communication Processors with overlapping working set One processor may prefetch data for the other Smaller cache size needed Better usage of loaded cache lines before eviction (spatial locality) Less congestion on limited memory connection Dynamic sharing if one processor needs less space, the other can use more Avoidance of false sharing Slide 14 Shared Caches: Disadvantages Multiple CPUs higher requirements higher bandwidth Cache should be larger (larger higher latency) Hit latency higher due to switch logic above cache Design more complex One CPU can evict data of other CPU Slide 15 Multicore Processors SUN UltraSparc IV / IV+ dual core 2x multithreaded per core UltraSparc T1 (Niagara): 8 cores 4x multithreaded per core one FPU for all cores low power UltraSparc T2 (Niagara 2) Slide 16 Intel Itanium 2 Dual Core - Montecito Two Itanium 2 cores Multi-threading (2 Threads) Simultaneous multi-threading for memory hierarchy resources Temporal multi-threading for core resources Besides end of time slice, an event, typically an L3 cache miss, might lead to a thread switch. Caches L1D 16 KB, L1I 16 KB L2D 256 KB, L2I 1 MB L3 9 MB Caches private to cores 1,7 Billion transistors Slide 17 Itanium 2 Dual Core Slide 18 Intel Core Duo 2 mobile-optimized execution cores No multi-threading Cache hierarchy Private 32-KB L1I and L1D Shared 2 MB L2 cache Provides efficient data sharing between both cores Power reduction Some states individually by each processor Deeper and enhanced deeper sleep states only for die Dynamic Cache Sizing feature Flushes entire cache This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity 151 Million transistors Slide 19 IBM Cell IBM, Sony, Toshiba Playstation 3 (Q1 2006) 256 GFlops Bei 3 GHz nur ~30W ganze PS3 nur 300-400$ http://www-128.ibm.com/developerworks/power/library/pa-cellperf Slide 20 Cell: Architecture 9 parallele processors Specialized for different tasks 1 large PPE - 8 SPEs Synergistic Processing Element Slide 21 Cell: SPE Synergistic Processing Element 128 registers 128-Bit SIMD Single Thread 256KByte local memory not cache DMA execute memory transfers Simple ISA Less functionality to save space Limitations can become a problem if memory access is too slow. 25,6 GFlops single precision fr multiply-add operations Slide 22 Intel Westmere EX Processor of the fat node of SuperMUC @ LRZ 2,4 GHz 9.6 Gflop/s per core 96 Gflop/s per socket 10 hyperthreaded cores, i.e. two logical cores each Caches 32 KB L1 private 256 KB L2 private 30 MB L3 shared 2,9 billion transistors Xeon E7-4870 (2,4 GHz, 10 Kerne, 30 MByte L3) Slide 23 Slide 24 NUMA On-chip NUMA L3 Cache organized in 10 slices Interconnection via a bidirectional ring bus 10-way physical address hashing to avoid hot spots, and can handle five parallel cache requests per clock cycle Mapping algorithm is not known, no migration support Off-chip NUMA Glueless combination of up to 8 sockets into SMP 4 Quick Path Interconnect (QPI) interfaces 2 on-chip memory controllers Slide 25 Slide 26 Slide 27 Cache Coherency Cbox Connects core to ringbus and one memory bank Responsible for processor read/write/writeback and external snoops, and returning cached data to core and QuickPath agents. Distribution of physical addresses is determined by hash function Sbox Caching Agent Each associated with 5 Cboxes Slide 28 Cache Coherency Bbox Home agent Responsible for cache coherency of the cache line in this memory. Keeps track of the Cbox replies due to coherence messages. Directory Assited Snoopy (DAS) Keeps states per cache line (I Idle or no remote sharers, R may be present on remote socket, E/D owned by IO Hub) If line is in I state it can be forwarded without waiting for snoop replies. Slide 29 Slide 30 Slide 31 Slide 32 Summary High frequency -> high power consumption Trend towards multiple cores on chip Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, Problem: memory latency and bandwidth

multicore architectures michael gerndt. development of microprocessors transistor capacity doubles...

Documents