cs 7810 lecture 21

CS 7810 Lecture 21

Threaded Multiple Path Execution

S. Wallace, B. Calder, D. TullsenProceedings of ISCA-25

June 1998

Leveraging SMT

• Recall branch fan-out from “Limits of ILP”

• Future processors will likely have no shortage of idle thread contexts

• Spawned threads are parallel, but have dependences with earlier instructions: registers, uncommitted stores, data cache values

• SMT may be an ideal candidate as threads share the same set of resources

SMT Vs. CMP

• A multi-threaded workload (on an SMT) is more tolerant of branch mpreds – TME makes most sense if there is a shortage of threads

• Power overheads are enormous – on an SMT, we may not have the option to execute speculative threads on low-power pipelines

• What about energy?

• Is CMP a better candidate?

Renaming Overview

r1 maps to p1

r1 … r1 br …. r1

p5 … p5 br …. p3

• Every branch causes a checkpoint of mappings, so we can recover quickly on a mis-predict• Each thread in the SMT can have 8 checkpoints

Threaded Multi-Path Execution

Key elements in TME:

• Identifying low-confidence branches

• Efficient thread spawning

• Efficient recovery on branch resolution

• Fetch priorities for each thread on SMT

Path Selection

• Only the primary path can spawn threads (prevents an exponential increase in threads)

• For each bpred entry, keep track of successive correct predictions (reset on mispredict) – if the counter is less than a threshold, the branch is low-confidence – note that a small counter size is more selective in picking low-confidence branches

Register Mappings

• In SMT, each thread can read any physical register

• Thread spawning requires a copy of the register mappings at that branch

• A copy involves transfer of (32 x 9 bits) – the new thread cannot begin renaming until this copy is complete – the copy may also hold up the primary thread if map table read ports are scarce

• Every new mapping can be placed on a bus and idle threads can snoop and keep pace

Spawning Algorithm

Spawning Algorithm

• When threads are idle, they keep pace and spawn a thread as soon as a low-confidence branch is encountered

• When a thread context becomes free and a low-confidence checkpoint already exists, the new context synchronizes mappings with the primary context and executes the primary path, while the old primary context executes the alternate path after reinstating the checkpoint

• If a newly idle thread has a low-confidence checkpoint, it starts executing the alternate path

Introduced Complexity

• Book-keeping to manage checkpoint locations – every branch has to track the location of its checkpoint

• Who frees a register value?

• What about memory dependences?

Loads can ignore stores that are not predecessors Maintain an array of bits to represent the path taken (each basic block corresponds to a bit in the array) Check for memory dependences only if the store’s path is a subset of the load’s path

(p5) r1

(p7) r1 (p8) r1

Processor Parameters

• Eight-wide processor with up to eight contexts; each context has eight checkpoints

• 32-entry issue queues, 4Kb gshare branch predictor, 7 cycle mpred penalty, memory latency of 62 cycles

• ICOUNT 2.8: first thread can bring in up to 8 instrs and the second thread fills in unused slots; occupancy in the front-end determines priority

• Focus on branch-limited programs: compress (20%), gcc (18%), go (30%), li (6%)

Results: Spare Contexts

Results: Bus Latency

Results: Branch Confidence

Results: Path Selection

Results: Fetch Policy

Results: Mpred Penalty

Conclusions

• Too much complexity/power overhead, too little benefit?

• Benefits may be higher for deeper pipelines; larger windows (this paper evaluates 8 windows of 48 instrs; does 2 x 192 yield better results?); longer memory latencies

• There is room for improvement with better branch confidence metrics

• CMPs will incur greater cost during thread spawning, but may be more power-efficient

Title

• Bullet

cs 7810 lecture 21

Documents

primary thread

new thread

efficient thread

primary path

lowconfidence checkpoint

idle threads

physical register thread

stores path