compiler optimization of scalar and memory resident values between speculative threads
DESCRIPTION
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al. P. P. P. P. C. C. C. C. C. Improving Performance of a Single Application. T1. T2. T3. T4. dependence. Store 0x86. Load 0x86. . Chip Boundary. - PowerPoint PPT PresentationTRANSCRIPT
Compiler Optimization of scalar and memory resident values between
speculative threads.
Antonia Zhai et. al.
Improving Performance of a Single Application
C
C
P
C
P
C
P
C
P
Finding parallel threads is difficult
T1 T2 T3 T4
Ch
ip B
ou
nd
ary
Load 0x86Store 0x86dependence
Compiler Makes Conservative Decisions
Cons
Make conservative decisions for ambiguous dependences– Complex control flow– Pointer and indirect references– Runtime input
Pros Examine the entire
program
… 0x800x86 …
Iteration #1 Iteration #2 Iteration #3 Iteration #4
… 0x600x66 …
… 0x900x96 …
… 0x500x56 …
for (i = 0; i < N; i++) {
… *q
*p …
}
Using Hardware to Exploit Parallelism
Search for independent instructions within a window
Cons
Exploit parallelism among a small number of instructions
Inst
ruc
tio
n W
ind
ow
Pros Disambiguate dependence
accurately at runtime
How to exploit parallelism with both hardware and compiler?
Instruction WindowLoad 0x88Store 0x66Add a1, a2
Instruction WindowLoad 0x88Store 0x66Add a1, a2
Thread-Level Speculation (TLS)
Compiler• Creates parallel threads
Hardware• Disambiguates dependences
My goal: Speed up programs with the help of TLS
Load *qStore *p
dependence
Sequential Speculatively parallel
Store *p
Load *q
Store 0x86Load 0x86
Time
Single Chip Multiprocessor (CMP)
C
P
C
P
C
P
Interconnect
Chip Boundary
Memory
Our support and optimization for TLS Built upon CMP Applied to other architectures that support multiple threads
• Replicated processor cores– Reduce design cost
• Scalable and decentralized design– Localize wires– Eliminate centralized structures
• Infrastructure transparency– Handle legacy codes
Thread-Level Speculation (TLS)
Load *qStore *p
dependence
Speculatively parallel
Time
Support thread level speculation
Recover from failed speculation
Buffer speculative writes from the memory
Track data dependences
Detect data dependence violation
Buffering Speculative Writes from the Memory
Memory
C
P
C
P
C
P
C
P
Data contents
Directory to maintain cache coherenceSpeculative state
Extending cache coherence protocol to support TLS
Interconnect
Ch
ip B
ou
nd
ary
Detecting Dependence Violations
C
P
C
P
Interconnect
store address violationdetected
Producer Consumer
• Producer forwards all store addresses• Consumer detects data dependence violation
Extending cache coherence protocol [ISCA’00]
Synchronizing Scalars
…=a
a=……=a
a=…
Identifying scalars causing dependences is straightforward
Tim
e
ProducerConsumer
Synchronizing Scalars
…=a
a=…
…=a
a=…
Dependent scalars should be synchronized
Tim
e
Signal(a)
Wait(a)
ProducerConsumer
Use forwarded value
Reducing the Critical Forwarding PathLong Critical Path Short Critical Path
Instruction scheduling can reduce critical forwarding pathCritical Path
Critical Pathwait
…=a
a = …
signal
Tim
e
Potential
Compiler Infrastructure
• Loops were targeted.– Selected loops were discarded
• Low coverage (<0.1%)
• Less than 30 instructions / iteration
• >16384 instructions / iteration
• Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results.
• Compiler inserts instructions to create and manage epochs.
• Compiler allocates forward variables on stack (forwarding frame)
• Compiler inserts wait and signal instructions.
Synchronization
• Constraints• Wait before first use• Signal after last definition• Signal on every possible path• Wait should be as late as possible• Signal should be as early as possible.
Data flow Analysis for synchronization
• The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included.
• CFG is modeled as a graph, with BB as nodes.
Instruction Scheduling
• Conservative scheduling
• Aggressive scheduling– Control Dependences– Data dependences
Instruction Scheduling
… = a
a = …
Initial Synchronization
wait(a)
signal(a)
SchedulingInstructions
SpeculativelySchedulingInstructions
Instruction Scheduling
SchedulingInstructions
wait(a)… = a
a = …signal(a)
… = a
a = …
wait(a)
InitialSynchronization
signal(a)a = …signal(a)
a = …signal(a)
a = …signal(a)
a = …signal(a)
a = …signal(a)
a = …signal(a)
a = …signal(a)
SpeculativelySchedulingInstructions
Instruction Scheduling
SchedulingInstructions
wait(a)… = a
a = …signal(a)
… = a
a = …
wait(a)
InitialSynchronization
SpeculativelyScheduling Instructions
a = …signal(a)
wait(a)… = a
*q=…
a = …signal(a)
*q=…
signal(a)
Instruction Scheduling
Dataflow Analysis:
Handles complex control flow
We Define Two Dataflow Analyses: “Stack” analysis:
finds instructions needed to compute the forwarded value
““Earliest” analysis:finds the earliest node to compute the forwarded
value
Computation Stack
Stores the instructions to compute a forwarded value
Associating a stack to every node for every forwarded scalar
We don’t know how to compute the forwarded value
We know how to compute the forwarded valuesignal a
a = a*11
Not yet evaluated
A Simplified Example from GCC
do {
} while(p);
. . .
start
p=p->jmp
q = p;p->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
counter=0wait(p)p->jmp?
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
endsignal p
counter=0wait(p)p->jmp?
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
endsignal p
signal p
p=p->next
counter=0wait(p)p->jmp?
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
endsignal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
endsignal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal pp=p->next
signal pp=p->next
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
endsignal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
Stack Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Node to Revisit
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
Solution is consistent
Scheduling Instructions
Dataflow Analysis:
Handles complex control flow
We Define Two Dataflow Problems:
“Stack” analysis:finds instructions needed to compute the forwarded
value.
““Earliest” analysis:finds the earliest node to compute the forwarded
value.
The Earliest Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Earliest
Not Earliest
Not Evaluated
The Earliest Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Earliest
Not Earliest
Not Evaluated
The Earliest Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Earliest
Not Earliest
Not Evaluated
The Earliest Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Earliest
Not Earliest
Not Evaluated
The Earliest Analysisstart
p=p->jmp
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
p=p->jmp counter=0wait(p)p->jmp?
signal p
p=p->next
signal p
p=p->next
Earliest
Not Earliest
Code Transformationstart
q = pp->real?
p=p->next
p = p->next
counter++;q=q->next;q?
end
counter=0wait(p)p->jmp?
p2=p->jmpp1=p2->nextSignal(p1)
p1=p->nextSignal(p1)
Earliest
Not Earliest
Instruction Scheduling
SchedulingInstructions
wait(a)… = a
a = …signal(a)
SpeculativelyScheduling Instructions
a = …signal(a)
wait(a)… = a
*q=…
a = …signal(a)
*q=…
… = a
a = …
wait(a)
InitialSynchronization
signal(a)
Aggressive scheduling
• Optimize common case by moving signal up.• On wrong path, send violate_epoch to next thread and
then forward correct value. • If instructions are scheduled past branches, then on
occurrence of an exception, a violation should be sent and non-speculative code should be executed.
• Changes: – Modify Meet operator to compute stack for frequently executed
paths– Add a new node for infrequent paths and insert violate_epoch
signals. Earliest is set to true for these nodes.
Aggressive scheduling
• Two new instructions– Mark_load – tells H/W to remember the address of location
Placed when load moves past store
– Unmark_load – clears the mark.
placed at original position of load instruction.
• In the meet operation, conflicting load marks are merged using logical or.
Speculating Beyond a Control Dependence
signal p
p=p->next
signal p
p=p->jmp
p=p->jmp p=p->next
endsignal p
Frequently
Executed
Path
end
violate(p)p=p->jmpsignal(p)
p1=p->nextsignal(p1)
Frequently
Executed Path
Speculating Beyond a Potential Data Dependence
*q = NULL
signal p
p=p->next
Hardware support
Similar to memory conflict buffer [Gallagher et al, ASPLOS’94]
signal p
p = p->next
end
ProfilingInformation
*q = NULL
end
p = p->next
p = p->nextsignal(p)
Speculative
Load
Experimental Framework
Benchmarks– SPECint95 and SPECint2000, -O3 optimization
Underlying architecture– 4-processor, single-chip multiprocessor– speculation supported by coherence
Simulator– superscalar, similar to MIPS R14K– simulates communication latency– models all bandwidth and contention
detailed simulation
C
C
P
C
P
Interconnect
C
P
C
P
Impact of Synchronization Stalls for Scalars
Performance bottleneck: synchronization (40% of execution time)
0
100
gcc
go mcf
pars
er
perlb
mk
twolf vp
r
com
pres
s
craf
tyga
pgz
ipijp
eg
m88
ksim
vorte
x
Detailed simulation:• TLS support• 4-processor CMP
• 4-way issue, out-of-order superscalar• 10-cycle communication latency
Synchronization Stall
Other
Nor
m. R
egio
n Ex
ec. T
ime
Instruction Scheduling
U=No Instruction Scheduling
A=Instruction Scheduling
Improves performance by 18%
gcc
go mcf
pars
er
perlb
mk
twolf vp
r
com
pres
s
craf
tyga
pgz
ipijp
eg
m88
ksim
vorte
x
0
100
U A U A U A U A U A U A U A U A U A U A U A U A U A U A
Still room for improvement
Synchronization Stall
Failed Speculation
Other
Busy
Nor
m. R
egio
n Ex
ec. T
ime
5% 22% 40%
Aggressively Scheduling Instructions
A=Instruction Scheduling
S=Speculating Across Control & Data Dependences
Significantly for some benchmarks
gcc
mcf
pars
er
perlb
mk
twolf
Synchronization Stall
Failed Speculation
Other
Busy
A A A A AS S S S S0
100
Nor
m. R
egio
n Ex
ec. T
ime
17% 19% 15%
Conclusions
• 6 of 14 applications , performance improvement by 6.2-28.5%
• Synchronization and Parallelization exposes some new data dependences.
• Speculative scheduling past control and data dependences gives a better performance.
Memory Variables
• Difficulty due to
– Aliasing : Traditional data flow analysis doesn’t help
– No clear way of defining location of last definition or first use.
• Potential gain in performance by reducing the failed cycles can be seen below.
Synchronizing Hardware
• Signal both the address and value to the next thread.
• Producer requires a signal address buffer for ensuring correct execution.– If a store address is found in the signal address buffer ->
misspeculation.
• Consumer has a local flag (use forward flag) to decide whether to load the value from the speculative cache or the local memory– The flag is reset if the same thread writes to the memory location
before reading.
– NULL address is handled as any other address, when an exception is caused, non-speculative code is used.
Compiler support
• Each load and store are assigned an ID based on the call stack and are profiled for dependence.
• A dependence graph is constructed and all Ids accessing the same location are grouped.
• All loads and stores belonging to the same group are synchronized by the compiler.
• The compiler then clones all the procedures on the call stack containing frequent data dependences, so that synchronization is executed only in the context of the call stack. The original code is then modified to include these cloned procedures.
• Data flow analysis is performed similar to scalar variables to insert a signal for the last store at the end of every path.
Analysis of data dependence patterns.
• Potential study to decide when to speculate and when to synchronize.– When inter-epoch dependences in more than 5% of all epochs are predicted
correctly, to obtain significant performance improvement.
– Dependence distances of 1 epoch have significant effect on speedup.
Impact of Failed Speculation on Memory-Resident Values
Next performance bottleneck: failed speculation (30% of execution)
Detailed simulation:• TLS support• 4-processor CMP
• 4-way issue, out-of-order superscalar• 10-cycle communication latency
Failed Speculation
Other
Nor
m. R
egio
n Ex
ec. T
ime
0
100
go
m88
ksim
ijpeg
gzip_
com
p
gzip_
deco
mp
vpr_
place gc
cm
cfcr
afty
pars
er
perlb
mk
gap
bzip2
_com
p
bzip2
_dec
omp
twolf
Failed Speculation
Synchronization Stall
Other
Busy
U=No synchronization inserted
C=Compiler-Inserted Synchronization
Seven benchmarks speedup by 5% to 46%
Compiler-Inserted Synchronization
0
100
go
m88
ksim
ijpeg
gzip_
com
p
gzip_
deco
mp
vpr_
place gc
cm
cfcr
afty
pars
er
perlb
mk
gap
bzip2
_com
pU C U C U C U C U C U C U C U C U C U C U C U C U C
10% 46% 13% 5% 8% 5% 21%
Nor
m. R
egio
n Ex
ec. T
ime
Compiler- vs. Hardware-Inserted Synchronization
0
100
go
m88
ksim
ijpeg
gzip_
com
p
gzip_
deco
mp
vpr_
place gc
cm
cfcr
afty
pars
er
perlb
mk
gap
bzip2
_com
pC H C H C H C H C H C H C H C H C H C H C H C H C H
C=Compiler-Inserted Synchronization
H=Hardware-Inserted Synchronization
Compiler and hardware each benefits different benchmarks
Nor
m. R
egio
n Ex
ec. T
ime
Failed Speculation
Synchronization Stall
Other
Busy
Hardwaredoes better
Compilerdoes better
Combining Hardware and Compiler Synchronization
C=Compiler-inserted synchronization
H=Hardware-inserted synchronization
B=Combining Both
The combination is more robust
0
100
go
m88
ksim
gzip_
com
p
gzip_
deco
mp
perlb
mk
gap
C H B C H B C H B C H B C H B C H BNor
m. R
egio
n Ex
ec. T
ime
Failed Speculation
Synchronization Stall
Other
Busy
Conclusion
• There is performance improvement by compiler inserted memory synchronization
• The hardware synchronization and compiler synchronization target different memory instructions and together do a better job.
Some other Issues
• Register pressure due to scheduling scalar and memory instructions.
• Profiling time for obtaining frequency of data dependences.
• Heuristics : choice of thresholds.