high-performance processors’ design...
TRANSCRIPT
![Page 1: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/1.jpg)
1
High-Performance Processors’Design Choices
Ramon Canal
PDFall 2013
![Page 2: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/2.jpg)
2
High-Performance Processors’Design Choices
1 Motivation2 Multiprocessors3 Multithreading
4 VLIW
![Page 3: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/3.jpg)
3
Outline• Motivation• Multiprocessors
– SISD, SIMD, MIMD, and MISD– Memory organization– Communication mechanisms
• Multithreading• VLIW
![Page 4: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/4.jpg)
4
MotivationInstruction-Level Parallelism (ILP): What all we have covered so far:
– simple pipelining– dynamic scheduling: scoreboarding and Tomasulo’s alg.– dynamic branch prediction– multiple-issue architectures: superscalar, VLIW– compiler techniques and software approaches
Bottom line: There just aren’t enough instructions that can actually be executed in parallel!– instruction issue: limit on maximum issue count– branch prediction: imperfect– # registers: finite– functional units: limited in number– data dependencies: hard to detect dependencies via memory
![Page 5: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/5.jpg)
5
So, What do we do?Key Idea: Increase number of running processes
– multiple processes: at a given “point” in time• i.e., at the granularity of one (or a few) clock cycles• not sufficient to have multiple processes at the OS level!
Two Approaches:– multiple CPU’s: each executing a distinct process
• “Multiprocessors” or “Parallel Architectures”– single CPU: executing multiple processes (“threads”)
• “Multi-threading” or “Thread-level parallelism”
![Page 6: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/6.jpg)
6
Taxonomy of Parallel Architectures
Flynn’s Classification:– SISD: Single instruction stream, single data stream
• uniprocessor– SIMD: Single instruction stream, multiple data streams
• same instruction executed by multiple processors• each has its own data memory• Ex: multimedia processors, vector architectures
– MISD: Multiple instruction streams, single data stream• successive functional units operate on the same stream of data• rarely found in general-purpose commercial designs• special-purpose stream processors (digital filters etc.)
– MIMD: Multiple instruction stream, multiple data stream• each processor has its own instruction and data streams• most popular form of parallel processing
– single-user: high-performance for one application– multiprogrammed: running many tasks simultaneously (e.g., servers)
![Page 7: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/7.jpg)
7
Multiprocessor: Memory Organization
Centralized, shared-memory multiprocessor:– usually few
processors– share single memory
& bus– use large caches
![Page 8: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/8.jpg)
8
Multiprocessor: Memory Organization
Distributed-memory multiprocessor:– can support large processor counts
• cost-effective way to scale memory bandwidth• works well if most accesses are to local memory node
– requires interconnection network• communication between processors becomes more complicated,
slower
![Page 9: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/9.jpg)
9
Communication Mechanisms• Shared-Memory Communication
– around for a long time, so well understood and standardized• memory-mapped
– ease of programming when communication patterns are complex or dynamically varying
– better use of bandwidth when items are small– Problem: cache coherence harder
• use “Snoopy” and other protocols
• Message-Passing Communication (i.e. intel’s Knight… family)– simpler hardware because keeping caches coherent is easier– communication is explicit, simpler to understand
• focuses programmer attention on communication– synchronization: naturally associated with communication
• fewer errors due to incorrect synchronization
![Page 10: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/10.jpg)
10
Multiprocessor: Hybrid Organization
• Use distributed-memory organization at top level• Each node itself may be a shared-memory
multiprocessor (2-8 processors)
![Page 11: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/11.jpg)
11
Multiprocessor: Hybrid Organization
• Use distributed-memory organization at top level• Each node itself may be a shared-memory
multiprocessor (2-8 processors)
• What about Big Data? Is it a “game changer”?– Next slides based on the following works:
• M. Ferdman et al. “Clearing the clouds” ASPLOS’12• P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12• B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE
MICRO 2012
– Next couple of slides © of Prof. Babak Falsafi (EPFL)
![Page 12: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/12.jpg)
Multiprocessors and Big Data
PD, 2013 12
![Page 13: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/13.jpg)
PD, 2013 13
![Page 14: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/14.jpg)
PD, 2013 14
![Page 15: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/15.jpg)
PD, 2013 15
![Page 16: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/16.jpg)
PD, 2013 16
![Page 17: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/17.jpg)
Scale-out Processors• Small LLC. Just to capture instructions.• More cores for higher throughput• “Pods” for small distance to memory
PD, 2013 17
![Page 18: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/18.jpg)
Performance• Iso server power (20MW)
PD, 2013 18
![Page 19: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/19.jpg)
Summary Multiprocessors• Need to tailor chip design to applications
– Big Data applications are too big for data caches. Best solution is too eliminate them.
– Big Data applications in need of coarse grainparallelism (i.e. At the request level)
– Still single-thread performance is STILL important for other applications (i.e. Computation intensive)
PD, 2013 19
![Page 20: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/20.jpg)
20
MultithreadingThreads: multiple processes that share code and data
(and much of their address space)• recently, the term has come to include processes that may run on
different processors and even have disjoint address spaces, as long as they share the code
Multithreading: exploit thread-level parallelism within a processor– fine-grain multithreading
• switch between threads on each instruction!– coarse-grain multithreading
• switch to a different thread only if current thread has a costly stall– E.g., switch only on a level-2 cache miss
![Page 21: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/21.jpg)
21
• How can we guarantee no dependencies between instructions in a pipeline?– One way is to interleave execution of instructions from
different program threads on same pipelineInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)
Multithreading
![Page 22: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/22.jpg)
22
Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
![Page 23: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/23.jpg)
23
MultithreadingFine-grain multithreading
– switch between threads on each instruction!– multiple threads executed in interleaved manner– interleaving is usually round-robin– CPU must be capable of switching threads on every
cycle!• fast, frequent switches
– main disadvantage:• slows down the execution of individual threads• that is, traded off latency for better throughput
![Page 24: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/24.jpg)
24
CDC 6600 Peripheral Processors (Cray, 1965)
• First multithreaded hardware• 10 “virtual” I/O processors• fixed interleave on simple pipeline• pipeline has 100ns cycle time• each processor executes one instruction every 1000ns• accumulator-based instruction set to reduce processor
state
![Page 25: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/25.jpg)
25
Denelcor HEP (Burton Smith, 1982)
• First commercial machine to use hardware threading in main CPU– 120 threads per processor– 10 MHz clock rate– Up to 8 processors– precursor to Tera MTA (Multithreaded Architecture)
![Page 26: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/26.jpg)
26
Tera MTA (Cray, 1997)• Up to 256 processors• Up to 128 active threads per processor• Processors and memory modules populate a sparse
3D torus interconnection fabric• Flat, shared main memory
– No data cache– Sustains one main memory access per cycle per processor
• 50W/processor @ 260MHz
![Page 27: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/27.jpg)
27
• Each processor supports 128 active hardware threads– 128 SSWs, 1024 target registers, 4096 general-purpose
registers• Every cycle, one instruction from one active thread is
launched into pipeline• Instruction pipeline is 21 cycles long• At best, a single thread can issue one instruction every
21 cycles– Clock rate is 260MHz, effective single thread issue rate is 260/21
= 12.4MHz
Tera MTA (Cray)
![Page 28: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/28.jpg)
28
MultithreadingCoarse-grain multithreading
– switch only if current thread has a costly stall• E.g., level-2 cache miss
– can accommodate slightly costlier switches– less likely to slow down an individual thread
• a thread is switched “off” only when it has a costly stall
– main disadvantage:• limited in ability to overcome throughput losses
– shorter stalls are ignored, and there may be plenty of those• issues instructions from a single thread
– every switch involves emptying and restarting the instruction pipeline
![Page 29: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/29.jpg)
29
IBM PowerPC RS64-III (Pulsar)
• Commercial coarse-grain multithreading CPU• Based on PowerPC with quad-issue in-order five
stage pipeline• Each physical CPU supports two virtual CPUs• On L2 cache miss, pipeline is flushed and
execution switches to second thread– short pipeline minimizes flush penalty (4 cycles),
small compared to memory access latency– flush pipeline to simplify exception handling
![Page 30: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/30.jpg)
30
Simultaneous Multithreading (SMT)
Key Idea: Exploit ILP across multiple threads!– Share CPU to multiple threads– i.e., convert thread-level parallelism into more ILP– exploit following features of modern processors:
• multiple functional units– modern processors typically have more functional units
available than a single thread can utilize• register renaming and dynamic scheduling
– multiple instructions from independent threads can co-exist and co-execute!
![Page 31: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/31.jpg)
31
Multithreading: Illustration
(a) A superscalar processor with no multithreading(b) A superscalar processor with coarse-grain multithreading(c) A superscalar processor with fine-grain multithreading(d) A superscalar processor with simultaneous multithreading
(SMT)
(a) (b) (c) (d)
![Page 32: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/32.jpg)
32
From Superscalar to SMT
• SMT is an out-of-order superscalar extended withhardware to support multiple executing threads
![Page 33: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/33.jpg)
33
Simultaneous Multithreaded Processor
![Page 34: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/34.jpg)
34
• Add multiple contexts and fetch engines to wide out-of-order superscalar processor– [Tullsen, Eggers, Levy, University of Washington, 1995]
• OOO instruction window already has most of the circuitry required to schedule from multiple threads
• Any single thread can utilize whole machine
• First examples:– Alpha 21464 (DEC/Compaq)– Pentium IV (Intel)– Power 5 (IBM)– Ultrasparc IV (Sun)
Simultaneous Multithreaded Processor
![Page 35: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/35.jpg)
35
SMT: Design Challenges• Dealing with a large register file
– needed to hold multiple contexts
• Maintaining low overhead on clock cycle– fast instruction issue: choosing what to issue– instruction commit: choosing what to commit– keeping cache conflicts within acceptable bounds
• Power hungry!
![Page 36: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/36.jpg)
36
Intel Pentium-4 Processor• Hyperthreading = SMT• Dual physical processors, each 2-way SMT• Logical processors share nearly all resources of the physical
processor– Caches, execution units, branch predictors
• Die area overhead of hyperthreading ~5 %• When one logical processor is stalled, the other can make
progress– No logical processor can use all entries in queues when two
threads are active• A processor running only one active software thread to run at
the same speed with or without hyperthreading
![Page 37: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/37.jpg)
37
Pentium 4 Micro-architecture
400 MHz System
Bus
RapidExecution
Engine
ExecutionTrace Cache
HyperPipelined
Technology
AdvancedTransfer Cache
Advanced DynamicExecution
StreamingSIMD
Extensions 2Enhanced FloatingPoint / Multi-Media
![Page 38: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/38.jpg)
38
Pentium 4 Micro-architecture
HyperPipelined
Technology
Advanced DynamicExecution
What hardware complexity does OoO and SMT incur in?
![Page 39: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/39.jpg)
Sun/Oracle Ultrasparc T5 (2013)
PD, 2013 39
16 Core3,6 Ghz8 threads/core(128 T/Chip)
X Core:2-way OoO16 KB I$16 KB D$128 KB L28 MB L3
28nm
![Page 40: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/40.jpg)
IBM Power 7
PD, 2013 40
![Page 41: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/41.jpg)
41
VLIW• Very Long Instruction Word:
– Compiler packs a fixed number of operations into a single VLIW “instruction”.
– The operations within a VLIW instruction are issued and executed in parallel.
– Example: • High-end signal processors (TMS320C6201) • Intel’s Itanium• Transmeta Crusoe, Efficeon
![Page 42: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/42.jpg)
42
VLIW• VLIW (very long instruction word) processors use a long instruction
word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously.
• All operations specified within a VLIW instruction must be independent of one another.
• Some of the key issues of a (V)LIW processor:– (very) long instruction word (up to 1 024 bits per instruction),– each instruction consists of multiple independent parallel operations,– each operation requires a statically known number of cycles to
complete,– a central controller that issues a long instruction word every cycle,– multiple FUs connected through a global shared register file.
![Page 43: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/43.jpg)
43
VLIW and Superscalar• sequential stream of long instruction words• instructions scheduled statically by the compiler• number of simultaneously issued instructions is fixed during compile-time • instruction issue is less complicated than in a superscalar processor• Disadvantage: VLIW processors cannot react on dynamic events,
e.g. cache misses, with the same flexibility like superscalars.• The number of instructions in a VLIW instruction word is usually fixed.• Padding VLIW instructions with no-ops is needed in case the full issue
bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.
• VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.
• VLIW processors take advantage of spatial parallelism.
![Page 44: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/44.jpg)
44
VLIW and Superscalar• Superscalar RISC solution
– Based on sequential execution semantics
– Compiler’s role is limited by the instruction set architecture
– Superscalar hardware identifies and exploits parallelism
• VLIW solution
– Based on parallel execution semantics
– VLIW ISA enhancements support static parallelization
– Compiler takes greater responsibility for exploiting parallelism
– Compiler / hardware collaboration often resembles superscalar
![Page 45: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/45.jpg)
45
VLIW and Superscalar• Advantages of pursuing VLIW architectures
– Make wide issue & deep latency less expensive in hardware
– Allow processor parallelism to scale with additional VLSI density
• Architect the processor to do well with in-order execution
– Enhance the ISA to allow static parallelization
– Use compiler technology to parallelize program
• Loop Unrolling, Software Pipelining, ...
– However, a purely static VLIW is not appropriate for general-purpose use
![Page 46: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/46.jpg)
46
Examples• Intel Itanium
• Transmeta Crusoe
• Almost all DSPs
– Texas Instruments
– ST Microelectronics
![Page 47: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/47.jpg)
47
Intel Itanium, Itanium 2
![Page 48: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/48.jpg)
48
IA-64 Encoding
Source: Intel/HP IA-64 Application ISA Guide 1.0
![Page 49: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/49.jpg)
49
IA-64 Templates
Source: Intel/HP IA-64 Application ISA Guide 1.0
![Page 50: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/50.jpg)
50
Intel's IA-64 ISA • Intel 64-bit Architecture (IA-64) register model:
– 128 64-bit general purpose registers GR0-GR127to hold values for integer and multimedia computations
• each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid.
– 128 82-bit floating-point registers FR0-FR127• registers f0 and f1 are read-only with values +0.0 and +1.0,
– 64 1-bit predicate registers P0-PR63• the first register p0 is read-only and always reads 1 (true)
– 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches
![Page 51: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/51.jpg)
51
Transmeta Crusoe i Efficeon
![Page 52: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/52.jpg)
52
Overview• HW/SW system for executing x86 code
– VLIW processor– Code Morphing Software
• Underlying ISA and details invisible– convenient level of indirection– upgrades, fixes, freedom for changes
• as long as new CMS is implemented– anything else?
![Page 53: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/53.jpg)
53
VLIW CPU• Simple
– in-order, very few interlocks– TM5400, 7 million transistors, 7 stage pipeline – low power, easier (and cheaper) to design
• TM5800– <=1GHz, 64KB L1, 512KB L2– 0.5-15W @ 300-1000MHz, 0.8-1.3V running typ mm app
![Page 54: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/54.jpg)
54
Crusoe vs. PIII mobile (temperature)
![Page 55: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/55.jpg)
55
VLIW CPU• RISC-like ISA
– molecule(long instruction)• 2 or 4 atoms (RISC-like instruction)• slot distribution?
• 64 gprs and 32 fprs– dedicated regs for x86 architectural regs
FADD ADD LD BRCC
Floatingpointunit
INT unit 1
INT unit 2
Load/Storeunit
Branchunit
128-bit molecule
![Page 56: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/56.jpg)
56
Conclusions• VLIW
– Reduces hardware complexity at the cost of increasing compiler complexity
– Good for DSPs– Not so good for GPPs (so far?)
![Page 57: High-Performance Processors’ Design Choicesdocencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProc… · · 2013-10-10– SISD: Single instruction stream, single data stream • uniprocessor](https://reader031.vdocuments.mx/reader031/viewer/2022022423/5a9e71127f8b9a7f178b5ae0/html5/thumbnails/57.jpg)
57
Conclusions• Multiprocessors
– Conventional superscalars are reaching ILP’s limits → exploit TLP or PLP
– Already known technology• Multithreading
– Good for extensive use of superscalar cores– More efficient than MP but more complex too