simultaneous multithreading:maximising on-chip parallelism dean tullsen, susan eggers, henry levy...

Simultaneous Multithreading:Maximising

On-Chip ParallelismDean Tullsen, Susan Eggers, Henry Levy

Department of Computer Science,

University of Washington,Seattle

Proceedings of ISCA ` 95, Italy

Presented by : Amit Gaur

Overview• Instruction Level Parallelism vs. Thread Level

Parallelism

• Motivation

• Simulation Environment and Workload

• Simultaneous Multithreading Models

• Performance Analysis

• Extensions in Design

• Single Chip Multiprocessing

• Summary

• Current Implementations

• Retrospective

Instruction Level Parallelism• Superscalar processors

• Shortcomings:

a) Instruction Dependencies

b) long latencies within single thread

Thread Level Parallelism• Traditional Multithreaded Architecture

• Exploit parallelism at application level

• Multiple threads: Inherent Parallelism

• Attack Vertical Waste: memory and functional unit latencies

• E.g.: Server applications, online transaction processing, web services

Need for Simultaneous Multithreading

• Attack vertical as well as horizontal waste

• Fetch instructions from multiple threads each cycle

• Exploit all parallelism: full utilization of execution resources

• Decrease in wasted issue slots

• Comparison with superscalar,fine-grain multithreaded processor, single-chip,multiple issue multiprocessors

Simulation Environment• Emulation based instruction level simulation

• Model on Alpha AXP 21164 extended for wide superscalar execution and multithreaded execution

• Support for increased single stream parallelism,more flexible instruction issue, improved branch prediction, and larger higher bandwidth caches

• Code generated using Multiflow trace scheduling compiler(static scheduling)

Simulation Environment(Continued)

•10 functional units(4 integer, 2 floating point, 3 Load/Store, 1 Branch)•All units pipelined•In-order issue of dependence free instructions with 8 instruction per thread window•L1 and L2 cache are on-chip•2048 entry, 2 bit branch prediction history table maintained•Support for upto 8 hardware contexts

Workload Specifications• SPEC92 Benchmark suite simulated

• To obtain TLP, distinct program allocated to each thread :Parallel workload based on multiprogramming

• Executable generated with lowest single thread execution time used

Limitations of Superscalar Processors

Superscalar Performance Degradation

•Overlap in a number of delaying causes

•Completely eliminating any 1 cause will not result in performance increase

•61% vertical waste and 39% horizontal waste

•Tackle both using simultaneous multithreading

Simultaneous Multithreading Models

• Fine Grain Multithreading: 1 thread issues instructions in each cycle

• SM:Full Simultaneous Issue: All eight threads compete for each issue slot, each cycle=> Maximum flexibility.

• SM:Single Issue, SM: Dual Issue, SM:Four Issue: limits the number of instructions each thread can issue, or have active in the scheduling window, each cycle.

• SM: Limited Connection: Each hardware context is connected to exactly one type of functional unit=> Least Dynamic of all Models.

Hardware Complexities of Models

Design Challenges in SMT processors• Issue slot usage limited by imbalances in resource needs

and resource availability• Number of active threads, limitations on buffer sizes,

instruction mix from multiple threads• Hardware complexity: need to implement superscalar

along with thread level parallelism• Use of priority threads can result in throughput reduction

as pipeline less likely to have instruction mix from different threads

• Mixing many threads also compromises performancce of individual threads.

• Tradeoff- small number of active threads, even smaller number of preferred threads

From Superscalar to SMT• SMT is an out of order superscalar extended with hardware

to support multiple threads• Multiple Thread Support:

a) per-thread program countersb) per-thread return stacksc) per-thread bookkeeping for instruction retirement,trap and instruction dispatch from prefetch queued) thread identifiers eg. With BTB and TLB entries

• Should SMT processors speculate?? Determine role of instruction speculation in SMT.

Instruction Speculation

• Speculation executes ‘probable’ instructions to hide branch latencies

• Processor fetches on a hardware based prediction

• Correct prediction - Keep going

• Incorrect prediction - Rollback

• SMT has 2 ways to deal with branch delay stalls

a) Speculation

b) Fetch/Issue from other threads

• SMT and Speculation:

Speculation can be wasteful on SMT as one thread’s speculative instructions can compete with & replace another’s non speculative instructions

Performance Evaluation of SMT

Performance Evaluation(Contd.)• Fine Grain MT: Max Speedup is 2.1. No gain in vertical waste

reduction after 4 threads• SMT models: Speedup ranges from 3.5 to 4.2, with issue rate

reaching 6.3 IPC• 4 issue model gets nearly same performance as full issue, dual

issue is at 94% of full issue at 8 threads• As ratio of threads to issue slots increases performance of

models increases.• Tradeoff between number of hardware contexts and hardware

complexity. • Adverse effect of competition for sharing of resources ->

lowest priority thread runs slowest• More strain on caches due to reduced locality- increase in I

and D cache misses• Overall increase in instruction throughput

Extensions: Alternative cache Design for SMT

• Comparison of private per thread caches(L1) to shared caches for Instructions and Data.

• Shared caches optimize for small number of threads

• Shared d-cache outperforms private d-cache for all configurations.

• Private I-caches perform better at high number of threads.

Speculation in SMT

SMT vs. Single chip Multiprocessing

• Similarities: use of multiple register sets, multiple functional units, need for high issue bandwidth on single chip

• Differences: Multiprocessor uses static allocation of resources, SM processor allows resource allocation to change every cycle.

• Same configuration used for testing performance:

a) 8KB private I-cache and D-cacheb) 256 KB 4-way set assoc.. L2 cachec) 2 MB direct mapped L3 cache

• Attempt to bias the test in favor of MP

Test Results

Test Results(Contd.)

• Test A,B,C : high ratio of FU and threads to issue bandwidth- greater opportunity to utilize issue bandwidth.

• Test D repeats A but SMT Processor has 10 FU’s. It still outperforms Multiprocessor

• Test E & F- MP is allowed greater issue bandwidth even then SMT processor shows better performance

• Test G -both have 8 FU’s and 8 issues per cycle, however SMT processor has 8 contexts and Multiprocessor has 2 processor (2 register sets)-SMT processor has 2.5 greater performance

Summary• Simultaneous Multithreading combines facilities of

superscalar as well as multithreaded architectures• It has the ability to boost utilization of resources by

dynamically scheduling functional units among multiple threads

• Comparison of several models of SMT have been done with wide superscalar, fine-grain multithreaded, and single chip, multiple issue multiprocessing architectures

• The results of simulation show that:a) a simultaneous multithreaded architecture with proper configuration can achieve 4 times instruction throughput of a single-threaded wide superscalar with the same issue widthb)simultaneous multithreading outperforms fine-grain multithreading by a factor of 2.c)simultaneous multiprocessor is superior in performance to a multiple issue multiprocessor, given same hardware resources

Commercial Machines

• MemoryLogix - SMT processor for mobile devices. • Sun Microsystems has announced a 4-SMT-

processor CMP. • Hyper-Threading Technology (Intel® Xeon®

Architecture)

• Clearwater Networks , a Los Gatos-based startup, was building an 8-context SMT network processor.

• Compaq Computer Corp. designed a 4-context SMT processor, Alpha 21464 (EV-8)

In Retrospect• The design of SMT architecture was influenced by

previous projects like the Tera, MIT Alewife and M-machine

• SMT was different from previous projects as it addressed a more complete and descriptive goal as compared to previous designs.

• The idea was to utilize thread level parallelism in place of lack of instruction level parallelism

• Aim was to target mainstream processor designs like the Alpha 21164

simultaneous multithreading:maximising on-chip parallelism dean tullsen, susan eggers, henry levy...

Documents

single issue

simultaneous issue

flexible instruction

dual issue

multiple issue multiprocessors

amit gaur slide

cycle sm

thread issues instructions