speculative parallelization in decoupled look-aheadparihar/talks/pact_talk.pdf · speculative...

29
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY

Upload: trankiet

Post on 26-Jun-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Speculative Parallelization in

Decoupled Look-ahead

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Page 2: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Motivation

• Single-thread performance still important design goal

• Branch mispredictions and cache misses are still

important performance hurdles

• One effective approach: Decoupled Look-ahead

• Look-ahead thread can often become the new bottleneck

• Speculative parallelization aptly suited for its acceleration

• More parallelism opportunities: Look-ahead thread does not

contain all dependences

• Simple architectural support: Look-ahead not correctness critical

1

Page 3: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Outline

• Motivation

• Baseline decoupled look-ahead

• Speculative parallelization in look-ahead

• Experimental analysis

• Summary

2

Page 4: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Baseline Decoupled Look-ahead

• Binary parser is used to generate skeleton from original program

• The skeleton runs on a separate core and

• Maintains its own memory image in local L1, no write-back to shared L2

• Sends all branch outcomes through FIFO queue and also helps prefetching

3

*A. Garg and M. Huang. “A Performance-Correctness Explicitly Decoupled Architecture”. In Proc. Int’l Symp. On Microarch., November 2008

Page 5: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Practical Advantages of Decoupled

Look-ahead • Look-ahead thread is a single, self-

reliant agent

• No need for quick spawning and register

communication support

• Little management overhead on main thread

• Easier for run-time control to disable

• Natural throttling mechanism to prevent

run-away prefetching

• Look-ahead thread size comparable to

aggregation of short helper threads

4

Page 6: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Look-ahead: A New Bottleneck

5

• Comparing four

systems

• Baseline

• Decoupled look-ahead

• Ideal

• Look-ahead alone

• Application categories

• Bottlenecks removed

• Speed of look-ahead

not the problem

• Look-ahead is the

new bottleneck

Page 7: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Outline

• Motivation

• Baseline decoupled look-ahead

• Speculative parallelization in look-ahead

• Experimental analysis

• Summary

6

Page 8: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Unique Opportunities

• Skeleton code offers more

parallelism

• Certain dependencies removed

during slicing for skeleton

• Look-ahead is inherently error-

tolerant

• Can ignore dependence violations

• Little to no support needed, unlike in

conventional TLS

7

Assembly code from equake

Page 9: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Software Support

8

• Dependence analysis

• Profile guided, coarse-grain at

basic block level

• Spawn and Target points

• Basic blocks with consistent

dependence distance of more

than threshold of DMIN

• Spawned thread executes from

target point

• Loop level parallelism is also

exploited

Parallelism at basic block level

Available parallelism for 2 core/contexts system

Spawn

Target

Tim

e

Page 10: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Parallelism in Look-ahead Binary

9

Available theoretical parallelism for 2 core/contexts system; DMIN = 15 BB

Parallelism potential with stables and consistent target and spawn points

Page 11: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Hardware and Runtime Support

10

• Thread spawning and merging

• Not too different from regular thread

spawning except

• Spawned thread shares the same

register and memory state

• Spawning thread will terminate at

the target PC

• Value communication

• Register-based naturally through shared

registers in SMT

• Memory-based communication can be

supported at different levels

• Partial versioning in cache at line level

Spawning of a new look-ahead thread

Page 12: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Outline

• Motivation

• Baseline decoupled look-ahead

• Speculative parallelization in look-ahead

• Experimental analysis

• Summary

11

Page 13: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Experimental Setup

• Program/binary analysis tool: based on ALTO

• Simulator: based on heavily modified SimpleScalar

• SMT, look-ahead and speculative parallelization support

• True execution-driven simulation (faithfully value modeling)

12

Microarchitectural and cache configurations

Page 14: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Speedup of speculative parallel

decoupled look-ahead

13

• 14 applications in which look-ahead is bottleneck

• Speedup of look-ahead systems over single thread

• Decoupled look-ahead over single thread baseline: 1.61x

• Speculative look-ahead over single thread baseline: 1.81x

• Speculative look-ahead over decoupled look-ahead:1.13x

Page 15: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Speculative Parallelization in Look-

ahead vs Main thread

14

• Skeleton provides more opportunities for parallelization

• Speedup of speculative parallelization

• Speculative look-ahead over decoupled LA baseline: 1.13x

• Speculative main thread over single thread baseline: 1.07x

IPC

Core1

Core2

Baseline (ST)

Baseline TLS

Decoupled LA

Speculative LA

LA

1.07x

1.13x

LA0 LA1

1.81x

Page 16: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Flexibility in Hardware Design

15

• Comparison of regular (partial versioning) cache

support with two other alternatives

• No cache versioning support

• Dependence violation detection and squash

Page 17: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Other Details in the Paper

• The effect of spawning look-ahead thread in preserving the

lead of overall look-ahead system

• Technique to avoid spurious spawns in dispatch stage which

could be subjected to branch misprediction

• Technique to mitigate the damage of runaway spawns

• Impact of speculative parallelization on overall recoveries

• Modifications to branch queue in multiple look-ahead system

16

Page 18: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Outline

• Motivation

• Baseline decoupled look-ahead

• Speculative parallelization in look-ahead

• Experimental analysis

• Summary

17

Page 19: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Summary

• Decoupled look-ahead can significantly improve single-thread

performance but look-ahead thread often becomes new

bottleneck

• Look-ahead thread lends itself to TLS acceleration

• Skeleton construction removes dependences and increases parallelism

• Hardware design is flexible and can be a simple extension of SMT

• A straightforward implementation of TLS look-ahead

achieves an average 1.13x (up to 1.39x) speedup

18

Page 20: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Speculative Parallelization in

Decoupled Look-ahead

Alok Garg, Raj Parihar, and Michael C. Huang

Dept. of Electrical & Computer Engineering

University of Rochester, Rochester, NY

Backup Slides

http://www.ece.rochester.edu/projects/acal/

Page 21: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

References (Partial) • J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a

Cache Miss. In Proc. Int’l Conf. on Supercomputing, pages 68–75, July 1997.

• O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large

Instruction Windows for Out-of-order Processors. In Proc. Int’l Symp. on High-Perf. Comp. Arch.,

pages 129–140, February 2003.

• J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative

Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int’l Symp. On Comp. Arch.,

pages 14–25, June 2001.

• C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int’l Symp. on

Comp. Arch., pages 2–13, June 2001.

• A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int’l Symp. on High-Perf. Comp.

Arch., pages 37–48, January 2001.

• Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int’l Symp.

on Microarch., pages 269–280, December 2000.

• G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int’l Symp. on Comp. Arch.,

pages 414–425, June 1995.

20

Page 22: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

References (cont…) • H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In

Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005.

• A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l

Symp. On Microarch., November 2008.

• C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int’l Symp. on Microarch.,

pages 85–96, Nov 2002.

• D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer

Sciences Department, University of Wisconsin-Madison, June 1997.

• J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread-

Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006.

• P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and

Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009.

• M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization

in Shared-Memory Multiprocessors. In Proc. Int’l Symp. on Comp. Arch., pages 13–24, June 2000.

21

Page 23: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Backup Slides

For reference only

Page 24: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Register Renaming Support

23

Page 25: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Modified Banked Branch Queue

24

Page 26: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Speedup of all applications

25

Page 27: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Impact on Overall Recoveries

26

Page 28: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

“Partial” Recovery in Look-ahead

27

• When recovery happens in primary look-ahead

• Do we kill the secondary look-ahead or not?

• If we don’t kill we gain around 1% performance improvement

• After recovery in main, secondary thread survives often (1000 insts)

• Spawning of a secondary look-ahead thread helps preserving

the slip of overall look-ahead system

Page 29: Speculative Parallelization in Decoupled Look-aheadparihar/talks/pact_talk.pdf · Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang

Quality of Thread Spawning

• Successful and run-away spawns: killed after certain time

28

• Impact of lazy spawning policy

• Wait for few cycles when spawning opportunity arrives

• Avoids spurious spawns; Saves some wrong path execution