cs 7810 lecture 21

19
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998

Upload: penny

Post on 05-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

CS 7810 Lecture 21. Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998. Leveraging SMT. Recall branch fan-out from “Limits of ILP” Future processors will likely have no shortage of idle thread contexts - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 7810    Lecture 21

CS 7810 Lecture 21

Threaded Multiple Path Execution

S. Wallace, B. Calder, D. TullsenProceedings of ISCA-25

June 1998

Page 2: CS 7810    Lecture 21

Leveraging SMT

• Recall branch fan-out from “Limits of ILP”

• Future processors will likely have no shortage of idle thread contexts

• Spawned threads are parallel, but have dependences with earlier instructions: registers, uncommitted stores, data cache values

• SMT may be an ideal candidate as threads share the same set of resources

Page 3: CS 7810    Lecture 21

SMT Vs. CMP

• A multi-threaded workload (on an SMT) is more tolerant of branch mpreds – TME makes most sense if there is a shortage of threads

• Power overheads are enormous – on an SMT, we may not have the option to execute speculative threads on low-power pipelines

• What about energy?

• Is CMP a better candidate?

Page 4: CS 7810    Lecture 21

Renaming Overview

r1 maps to p1

r1 … r1 br …. r1

p5 … p5 br …. p3

• Every branch causes a checkpoint of mappings, so we can recover quickly on a mis-predict• Each thread in the SMT can have 8 checkpoints

Page 5: CS 7810    Lecture 21

Threaded Multi-Path Execution

Key elements in TME:

• Identifying low-confidence branches

• Efficient thread spawning

• Efficient recovery on branch resolution

• Fetch priorities for each thread on SMT

Page 6: CS 7810    Lecture 21

Path Selection

• Only the primary path can spawn threads (prevents an exponential increase in threads)

• For each bpred entry, keep track of successive correct predictions (reset on mispredict) – if the counter is less than a threshold, the branch is low-confidence – note that a small counter size is more selective in picking low-confidence branches

Page 7: CS 7810    Lecture 21

Register Mappings

• In SMT, each thread can read any physical register

• Thread spawning requires a copy of the register mappings at that branch

• A copy involves transfer of (32 x 9 bits) – the new thread cannot begin renaming until this copy is complete – the copy may also hold up the primary thread if map table read ports are scarce

• Every new mapping can be placed on a bus and idle threads can snoop and keep pace

Page 8: CS 7810    Lecture 21

Spawning Algorithm

Page 9: CS 7810    Lecture 21

Spawning Algorithm

• When threads are idle, they keep pace and spawn a thread as soon as a low-confidence branch is encountered

• When a thread context becomes free and a low-confidence checkpoint already exists, the new context synchronizes mappings with the primary context and executes the primary path, while the old primary context executes the alternate path after reinstating the checkpoint

• If a newly idle thread has a low-confidence checkpoint, it starts executing the alternate path

Page 10: CS 7810    Lecture 21

Introduced Complexity

• Book-keeping to manage checkpoint locations – every branch has to track the location of its checkpoint

• Who frees a register value?

• What about memory dependences?

Loads can ignore stores that are not predecessors Maintain an array of bits to represent the path taken (each basic block corresponds to a bit in the array) Check for memory dependences only if the store’s path is a subset of the load’s path

(p5) r1

(p7) r1 (p8) r1

Page 11: CS 7810    Lecture 21

Processor Parameters

• Eight-wide processor with up to eight contexts; each context has eight checkpoints

• 32-entry issue queues, 4Kb gshare branch predictor, 7 cycle mpred penalty, memory latency of 62 cycles

• ICOUNT 2.8: first thread can bring in up to 8 instrs and the second thread fills in unused slots; occupancy in the front-end determines priority

• Focus on branch-limited programs: compress (20%), gcc (18%), go (30%), li (6%)

Page 12: CS 7810    Lecture 21

Results: Spare Contexts

Page 13: CS 7810    Lecture 21

Results: Bus Latency

Page 14: CS 7810    Lecture 21

Results: Branch Confidence

Page 15: CS 7810    Lecture 21

Results: Path Selection

Page 16: CS 7810    Lecture 21

Results: Fetch Policy

Page 17: CS 7810    Lecture 21

Results: Mpred Penalty

Page 18: CS 7810    Lecture 21

Conclusions

• Too much complexity/power overhead, too little benefit?

• Benefits may be higher for deeper pipelines; larger windows (this paper evaluates 8 windows of 48 instrs; does 2 x 192 yield better results?); longer memory latencies

• There is room for improvement with better branch confidence metrics

• CMPs will incur greater cost during thread spawning, but may be more power-efficient

Page 19: CS 7810    Lecture 21

Title

• Bullet