cs 7810 lecture 18 the potential for using thread-level data speculation to facilitate automatic...

16
CS 7810 Lecture 18 The Potential for Using Thread-Level Data culation to Facilitate Automatic Paralleliza J.G. Steffan and T.C. Mowry Proceedings of HPCA-4 February 1998

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

CS 7810 Lecture 18

The Potential for Using Thread-Level DataSpeculation to Facilitate Automatic Parallelization

J.G. Steffan and T.C. MowryProceedings of HPCA-4

February 1998

Page 2: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Multi-Threading

• CMPs advocate low complexity and static approaches to parallelism extraction

• Resolving memory dependences for integer codes is not easy!

Large window100 in-flight instrs

Compiler-generated threads4 windows of 25 instrs each

Page 3: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Probable Conflicts

p

q

Page 4: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Example: Compress

Page 5: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Example Execution

• Bullet

Page 6: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Compiler Optimizations

• Induction variables: in_count

• Reduction: out_count

• Parallel I/O: getchar() and putchar()

• Scalar forwarding: free_entries

• Ambiguous loads and stores: hash[…]

Page 7: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Methodology

• Threads (epochs) were constructed by hand

• The procs are in-order and instrs are unit latency

Page 8: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Ambiguous Loads and Stores

Page 9: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Average Run Lengths

Page 10: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Forwarding Registers and Scalars

Page 11: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Average Run Lengths

Page 12: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Realistic Models

• 10-cycle forwarding latency• Sharing at cache line granularity• Recovery from misspeculation• Results are not sensitive to forwarding latency or cache line size

Page 13: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Hardware Support

• Cache coherence protocol for the L1 caches

• For each cache line, keep track of whether the line has been read/modified

• When the oldest thread writes to a cache line, an invalidate is sent to the other caches

• The younger thread sets a violation flag if the younger thread has speculatively loaded the line -- s/w recovery is initiated when the thread commits

• Cache line evicts cause violations (not common)

Page 14: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Role of the Compiler

• Profiling to identify epochs large enough to offset thread management and communication cost; small enough to have low speculative state

• Estimate probability of violation (static/dynamic)

• Optimizations (induction, reduction, parallel I/O)

• Scalar forwarding and rescheduling

• Insertion of register recovery code

Page 15: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Conclusions

• Hardware catches violations; compiler can parallelize aggressively

• Competitive implementation: large window with store sets prediction

Page 16: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings

Title

• Bullet