compiler optimization of scalar and memory resident values between speculative threads

67
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

Upload: claire-rivera

Post on 03-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al. P. P. P. P. C. C. C. C. C. Improving Performance of a Single Application. T1. T2. T3. T4. dependence. Store 0x86. Load 0x86. . Chip Boundary. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Compiler Optimization of scalar and memory resident values between

speculative threads.

Antonia Zhai et. al.

Page 2: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Improving Performance of a Single Application

C

C

P

C

P

C

P

C

P

Finding parallel threads is difficult

T1 T2 T3 T4

Ch

ip B

ou

nd

ary

Load 0x86Store 0x86dependence

Page 3: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Compiler Makes Conservative Decisions

Cons

Make conservative decisions for ambiguous dependences– Complex control flow– Pointer and indirect references– Runtime input

Pros Examine the entire

program

… 0x800x86 …

Iteration #1 Iteration #2 Iteration #3 Iteration #4

… 0x600x66 …

… 0x900x96 …

… 0x500x56 …

for (i = 0; i < N; i++) {

… *q

*p …

}

Page 4: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Using Hardware to Exploit Parallelism

Search for independent instructions within a window

Cons

Exploit parallelism among a small number of instructions

Inst

ruc

tio

n W

ind

ow

Pros Disambiguate dependence

accurately at runtime

How to exploit parallelism with both hardware and compiler?

Instruction WindowLoad 0x88Store 0x66Add a1, a2

Instruction WindowLoad 0x88Store 0x66Add a1, a2

Page 5: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Thread-Level Speculation (TLS)

Compiler• Creates parallel threads

Hardware• Disambiguates dependences

My goal: Speed up programs with the help of TLS

Load *qStore *p

dependence

Sequential Speculatively parallel

Store *p

Load *q

Store 0x86Load 0x86

Time

Page 6: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Single Chip Multiprocessor (CMP)

C

P

C

P

C

P

Interconnect

Chip Boundary

Memory

Our support and optimization for TLS Built upon CMP Applied to other architectures that support multiple threads

• Replicated processor cores– Reduce design cost

• Scalable and decentralized design– Localize wires– Eliminate centralized structures

• Infrastructure transparency– Handle legacy codes

Page 7: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Thread-Level Speculation (TLS)

Load *qStore *p

dependence

Speculatively parallel

Time

Support thread level speculation

Recover from failed speculation

Buffer speculative writes from the memory

Track data dependences

Detect data dependence violation

Page 8: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Buffering Speculative Writes from the Memory

Memory

C

P

C

P

C

P

C

P

Data contents

Directory to maintain cache coherenceSpeculative state

Extending cache coherence protocol to support TLS

Interconnect

Ch

ip B

ou

nd

ary

Page 9: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Detecting Dependence Violations

C

P

C

P

Interconnect

store address violationdetected

Producer Consumer

• Producer forwards all store addresses• Consumer detects data dependence violation

Extending cache coherence protocol [ISCA’00]

Page 10: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Synchronizing Scalars

…=a

a=……=a

a=…

Identifying scalars causing dependences is straightforward

Tim

e

ProducerConsumer

Page 11: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Synchronizing Scalars

…=a

a=…

…=a

a=…

Dependent scalars should be synchronized

Tim

e

Signal(a)

Wait(a)

ProducerConsumer

Use forwarded value

Page 12: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Reducing the Critical Forwarding PathLong Critical Path Short Critical Path

Instruction scheduling can reduce critical forwarding pathCritical Path

Critical Pathwait

…=a

a = …

signal

Tim

e

Page 13: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Potential

Page 14: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Compiler Infrastructure

• Loops were targeted.– Selected loops were discarded

• Low coverage (<0.1%)

• Less than 30 instructions / iteration

• >16384 instructions / iteration

• Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results.

• Compiler inserts instructions to create and manage epochs.

• Compiler allocates forward variables on stack (forwarding frame)

• Compiler inserts wait and signal instructions.

Page 15: Compiler Optimization of  scalar  and  memory resident  values between speculative threads
Page 16: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Synchronization

• Constraints• Wait before first use• Signal after last definition• Signal on every possible path• Wait should be as late as possible• Signal should be as early as possible.

Page 17: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Data flow Analysis for synchronization

• The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included.

• CFG is modeled as a graph, with BB as nodes.

Page 18: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

• Conservative scheduling

• Aggressive scheduling– Control Dependences– Data dependences

Page 19: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

… = a

a = …

Initial Synchronization

wait(a)

signal(a)

SchedulingInstructions

SpeculativelySchedulingInstructions

Page 20: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

SchedulingInstructions

wait(a)… = a

a = …signal(a)

… = a

a = …

wait(a)

InitialSynchronization

signal(a)a = …signal(a)

a = …signal(a)

a = …signal(a)

a = …signal(a)

a = …signal(a)

a = …signal(a)

a = …signal(a)

SpeculativelySchedulingInstructions

Page 21: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

SchedulingInstructions

wait(a)… = a

a = …signal(a)

… = a

a = …

wait(a)

InitialSynchronization

SpeculativelyScheduling Instructions

a = …signal(a)

wait(a)… = a

*q=…

a = …signal(a)

*q=…

signal(a)

Page 22: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

Dataflow Analysis:

Handles complex control flow

We Define Two Dataflow Analyses: “Stack” analysis:

finds instructions needed to compute the forwarded value

““Earliest” analysis:finds the earliest node to compute the forwarded

value

Page 23: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Computation Stack

Stores the instructions to compute a forwarded value

Associating a stack to every node for every forwarded scalar

We don’t know how to compute the forwarded value

We know how to compute the forwarded valuesignal a

a = a*11

Not yet evaluated

Page 24: Compiler Optimization of  scalar  and  memory resident  values between speculative threads
Page 25: Compiler Optimization of  scalar  and  memory resident  values between speculative threads
Page 26: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

A Simplified Example from GCC

do {

} while(p);

. . .

start

p=p->jmp

q = p;p->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

counter=0wait(p)p->jmp?

Page 27: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

endsignal p

counter=0wait(p)p->jmp?

Page 28: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

endsignal p

signal p

p=p->next

counter=0wait(p)p->jmp?

Page 29: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

endsignal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

Page 30: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

endsignal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal pp=p->next

signal pp=p->next

Page 31: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

endsignal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 32: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 33: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Node to Revisit

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

Page 34: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 35: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 36: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 37: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Node to Revisit

Page 38: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 39: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

Page 40: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Stack Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Node to Revisit

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

Solution is consistent

Page 41: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Scheduling Instructions

Dataflow Analysis:

Handles complex control flow

We Define Two Dataflow Problems:

“Stack” analysis:finds instructions needed to compute the forwarded

value.

““Earliest” analysis:finds the earliest node to compute the forwarded

value.

Page 42: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

The Earliest Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Earliest

Not Earliest

Not Evaluated

Page 43: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

The Earliest Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Earliest

Not Earliest

Not Evaluated

Page 44: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

The Earliest Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Earliest

Not Earliest

Not Evaluated

Page 45: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

The Earliest Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Earliest

Not Earliest

Not Evaluated

Page 46: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

The Earliest Analysisstart

p=p->jmp

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

p=p->jmp counter=0wait(p)p->jmp?

signal p

p=p->next

signal p

p=p->next

Earliest

Not Earliest

Page 47: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Code Transformationstart

q = pp->real?

p=p->next

p = p->next

counter++;q=q->next;q?

end

counter=0wait(p)p->jmp?

p2=p->jmpp1=p2->nextSignal(p1)

p1=p->nextSignal(p1)

Earliest

Not Earliest

Page 48: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

SchedulingInstructions

wait(a)… = a

a = …signal(a)

SpeculativelyScheduling Instructions

a = …signal(a)

wait(a)… = a

*q=…

a = …signal(a)

*q=…

… = a

a = …

wait(a)

InitialSynchronization

signal(a)

Page 49: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Aggressive scheduling

• Optimize common case by moving signal up.• On wrong path, send violate_epoch to next thread and

then forward correct value. • If instructions are scheduled past branches, then on

occurrence of an exception, a violation should be sent and non-speculative code should be executed.

• Changes: – Modify Meet operator to compute stack for frequently executed

paths– Add a new node for infrequent paths and insert violate_epoch

signals. Earliest is set to true for these nodes.

Page 50: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Aggressive scheduling

• Two new instructions– Mark_load – tells H/W to remember the address of location

Placed when load moves past store

– Unmark_load – clears the mark.

placed at original position of load instruction.

• In the meet operation, conflicting load marks are merged using logical or.

Page 51: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Speculating Beyond a Control Dependence

signal p

p=p->next

signal p

p=p->jmp

p=p->jmp p=p->next

endsignal p

Frequently

Executed

Path

end

violate(p)p=p->jmpsignal(p)

p1=p->nextsignal(p1)

Frequently

Executed Path

Page 52: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Speculating Beyond a Potential Data Dependence

*q = NULL

signal p

p=p->next

Hardware support

Similar to memory conflict buffer [Gallagher et al, ASPLOS’94]

signal p

p = p->next

end

ProfilingInformation

*q = NULL

end

p = p->next

p = p->nextsignal(p)

Speculative

Load

Page 53: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Experimental Framework

Benchmarks– SPECint95 and SPECint2000, -O3 optimization

Underlying architecture– 4-processor, single-chip multiprocessor– speculation supported by coherence

Simulator– superscalar, similar to MIPS R14K– simulates communication latency– models all bandwidth and contention

detailed simulation

C

C

P

C

P

Interconnect

C

P

C

P

Page 54: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Impact of Synchronization Stalls for Scalars

Performance bottleneck: synchronization (40% of execution time)

0

100

gcc

go mcf

pars

er

perlb

mk

twolf vp

r

com

pres

s

craf

tyga

pgz

ipijp

eg

m88

ksim

vorte

x

Detailed simulation:• TLS support• 4-processor CMP

• 4-way issue, out-of-order superscalar• 10-cycle communication latency

Synchronization Stall

Other

Nor

m. R

egio

n Ex

ec. T

ime

Page 55: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Instruction Scheduling

U=No Instruction Scheduling

A=Instruction Scheduling

Improves performance by 18%

gcc

go mcf

pars

er

perlb

mk

twolf vp

r

com

pres

s

craf

tyga

pgz

ipijp

eg

m88

ksim

vorte

x

0

100

U A U A U A U A U A U A U A U A U A U A U A U A U A U A

Still room for improvement

Synchronization Stall

Failed Speculation

Other

Busy

Nor

m. R

egio

n Ex

ec. T

ime

5% 22% 40%

Page 56: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Aggressively Scheduling Instructions

A=Instruction Scheduling

S=Speculating Across Control & Data Dependences

Significantly for some benchmarks

gcc

mcf

pars

er

perlb

mk

twolf

Synchronization Stall

Failed Speculation

Other

Busy

A A A A AS S S S S0

100

Nor

m. R

egio

n Ex

ec. T

ime

17% 19% 15%

Page 57: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Conclusions

• 6 of 14 applications , performance improvement by 6.2-28.5%

• Synchronization and Parallelization exposes some new data dependences.

• Speculative scheduling past control and data dependences gives a better performance.

Page 58: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Memory Variables

• Difficulty due to

– Aliasing : Traditional data flow analysis doesn’t help

– No clear way of defining location of last definition or first use.

• Potential gain in performance by reducing the failed cycles can be seen below.

Page 59: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Synchronizing Hardware

• Signal both the address and value to the next thread.

• Producer requires a signal address buffer for ensuring correct execution.– If a store address is found in the signal address buffer ->

misspeculation.

• Consumer has a local flag (use forward flag) to decide whether to load the value from the speculative cache or the local memory– The flag is reset if the same thread writes to the memory location

before reading.

– NULL address is handled as any other address, when an exception is caused, non-speculative code is used.

Page 60: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Compiler support

• Each load and store are assigned an ID based on the call stack and are profiled for dependence.

• A dependence graph is constructed and all Ids accessing the same location are grouped.

• All loads and stores belonging to the same group are synchronized by the compiler.

• The compiler then clones all the procedures on the call stack containing frequent data dependences, so that synchronization is executed only in the context of the call stack. The original code is then modified to include these cloned procedures.

• Data flow analysis is performed similar to scalar variables to insert a signal for the last store at the end of every path.

Page 61: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Analysis of data dependence patterns.

• Potential study to decide when to speculate and when to synchronize.– When inter-epoch dependences in more than 5% of all epochs are predicted

correctly, to obtain significant performance improvement.

– Dependence distances of 1 epoch have significant effect on speedup.

Page 62: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Impact of Failed Speculation on Memory-Resident Values

Next performance bottleneck: failed speculation (30% of execution)

Detailed simulation:• TLS support• 4-processor CMP

• 4-way issue, out-of-order superscalar• 10-cycle communication latency

Failed Speculation

Other

Nor

m. R

egio

n Ex

ec. T

ime

0

100

go

m88

ksim

ijpeg

gzip_

com

p

gzip_

deco

mp

vpr_

place gc

cm

cfcr

afty

pars

er

perlb

mk

gap

bzip2

_com

p

bzip2

_dec

omp

twolf

Page 63: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Failed Speculation

Synchronization Stall

Other

Busy

U=No synchronization inserted

C=Compiler-Inserted Synchronization

Seven benchmarks speedup by 5% to 46%

Compiler-Inserted Synchronization

0

100

go

m88

ksim

ijpeg

gzip_

com

p

gzip_

deco

mp

vpr_

place gc

cm

cfcr

afty

pars

er

perlb

mk

gap

bzip2

_com

pU C U C U C U C U C U C U C U C U C U C U C U C U C

10% 46% 13% 5% 8% 5% 21%

Nor

m. R

egio

n Ex

ec. T

ime

Page 64: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Compiler- vs. Hardware-Inserted Synchronization

0

100

go

m88

ksim

ijpeg

gzip_

com

p

gzip_

deco

mp

vpr_

place gc

cm

cfcr

afty

pars

er

perlb

mk

gap

bzip2

_com

pC H C H C H C H C H C H C H C H C H C H C H C H C H

C=Compiler-Inserted Synchronization

H=Hardware-Inserted Synchronization

Compiler and hardware each benefits different benchmarks

Nor

m. R

egio

n Ex

ec. T

ime

Failed Speculation

Synchronization Stall

Other

Busy

Hardwaredoes better

Compilerdoes better

Page 65: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Combining Hardware and Compiler Synchronization

C=Compiler-inserted synchronization

H=Hardware-inserted synchronization

B=Combining Both

The combination is more robust

0

100

go

m88

ksim

gzip_

com

p

gzip_

deco

mp

perlb

mk

gap

C H B C H B C H B C H B C H B C H BNor

m. R

egio

n Ex

ec. T

ime

Failed Speculation

Synchronization Stall

Other

Busy

Page 66: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Conclusion

• There is performance improvement by compiler inserted memory synchronization

• The hardware synchronization and compiler synchronization target different memory instructions and together do a better job.

Page 67: Compiler Optimization of  scalar  and  memory resident  values between speculative threads

Some other Issues

• Register pressure due to scheduling scalar and memory instructions.

• Profiling time for obtaining frequency of data dependences.

• Heuristics : choice of thresholds.