Download - HLL VM Implementation
chap6.5~6.7
김정기
Kim, Jung ki
October 11th, 2006
HLL VM Implementation HLL VM Implementation
System programming 특강 , 2006
Contents
Basic Emulation
High-performance Emulation–Optimization Framework–Optimizations
Case study: The Jikes Research Virtual Machine
Basic Emulation
The emulation engine in a JVM can be implemented in a number of ways–interpretation–just-in-time compilation(JIT)
JIT–Methods are compiled at the time they are first invoked–JIT compilation is enabled because the Java ISA’s instructions in a method can easily be discovered
JIT vs conventional compiler
JIT doesn’t have a frontend for parsing and syntax checking before intermediate form
Different intermediate form before optimization
Optimization strategy–multiple optimization levels through profiling–applying optimizations selectively to hot spot(not entire method)
Examples–interpretation : Sun HotSpot, IBM DK–compilation : Jikes RVM
Contents
Basic Emulation
High-performance Emulation–Optimization Framework–Optimizations
Case study: The Jikes Research Virtual Machine
High-Performance Emulation
Two challenges for HLL VMs–to offset run-time optimization overhead with execution-time improvement–to make an object-oriented program go fast
Frequent use of addressing indirection and small methods
Optimization Framework
Host Platform
Interpreter
Bytecodes
Profile Data CompiledCode
OptimizedCode
SimpleCompiler
OptimizingCompiler
translated codeprofile data
Basic Emulation
High-performance Emulation–Optimization Framework–Optimizations•Code Relayout•Method Inlining•Optimizing Virtual Method Calls•Multiversioning and Specialization•On-Stack Replacement•Optimization of Heap-Allocated Objects• Low-Level Optimizations•Optimizing Garbage Collection
Case study: The Jikes Research Virtual Machine
Contents
Code Relayout
the most commonly followed control flow paths are in contiguous location in memory
improved locality and conditional branch predictability
Code Relayout
A
B D
CF
G97
30
1
1
70
29
1
3
68
E
6829
2
ABr cond1 = = falseDBr cond3 = = true
F
Br uncond
G
Br cond2 = = false
E
Br uncond
B
C
Br cond4 = = true
Br uncond
Method Inlining
Benefits– calling overheads decrease especially in object-oriented• passing parameters• managing stack frame• control transfer
– code analysis scope expands
• more optimizations are applicable.
Effects may be different by method’s size– small method : beneficial in most of cases – large method : low portion of calling sequence, sophisticated
cost-benefit analysis is needed -> code explosion may occur : poor cache behavior, performance losses
Method Inlining
Processing sequence1. profiling by instrument2. constructing call-graph at certain intervals3. invoking dynamic optimization system when call counts exceeds certain threshold
Reducing analysis overhead–profile counter is included in stack frame.–When meet the threshold, “walk” backward through the stack
Method Inlining
MAIN
A X
B C Y
900 100
1500100 100025
MAIN
900
A
1500
C
With a call graph
via stack frame
threshold
threshold
Optimizing Virtual Method Calls
–the most common case–Determination of which code to use is done at run time via a dynamic method table lookup.
Invokevirtual <perimeter>
If (a.isInstanceof(Sqaure)) {inlined code…
.
.}Else invokevirtual <perimeter>
Optimizing Virtual Method Calls
If inlining is not useful, just removing method table lookup is also helpful
Polymorphic Inline Caching
…invokevirtual
perimeter…
…call PIC stub
…
if type = circle jump to circle perimeter codeelse if type = square jump to square perimeter codeelse call lookup
circle perimeter code
square perimeter code
update PIC stub;method table lookup code
polymorphic Inline Cache stub
Multiversioning and Specialization
Multiversioning by specialization–If some variables or references are always assigned data values or types known to be constant (or from a limited range)– simplified, specialized code can be used
for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];}
for (int i=0;i<1000;i++) { if (A[i] ==0 )
B[i] = 0;
}
if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];
specialized code
Multiversioning and Specialization
defered compilation of the general case
for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];}
for (int i=0;i<1000;i++) { if (A[i] ==0 )
B[i] = 0;
}
jump to dynamic compiler for deferred compilation
On-Stack Replacement
due to no benefit until the next call, OSR is needed
Implementation stack needs to be modified on the fly.
OSR is needed in this case–inlining in long-running method–defered compilation–debugging (user expect to observe the architected instruction sequence)
On-Stack Replacement
stack
implementationframe A
stack
implementationframe B
methodcode
opt. level x
architectedframe
methodcode
opt. level y
optimize/de-optimizemethod code
1. extract architected state
2. generate a new implementation frame
3. replace the current implementation stack frame
On-Stack Replacement
OSR is a complex operation
If the initial stack frame is maintained by an interpreter or an nonoptimizing compiler, then extracting architected stack state straightforward
On the other hand, compiler may define a set of program points where OSR can potentially occur and then ensure that the architected values are live at that point in the execution.
On-Stack Replacement
Meaning of OSR–state performance benefits are small–allowing the implementation of debuggers–reducing start-up time in defered compilation –improving cache performance
Optimization of Heap-Allocated Objects
Creating objects and garbage collection have high cost
the code for the heap allocation and object initialization can be inlined for frequently allocated objects
scalar replacement–escape analysis–effective for reducing object access delays
Optimization of Heap-Allocated Objects
Scarlar Replacement
-access delays are reducedclass square {int side;int area;}void calculate() { a = new square(); a.side = 3; a.area = a.side * a.side; System.out.println(a.area);}
void calculate() { int t1 = 3; int t2 = t1 * t1; System.out.println(t2);}
Optimization of Heap-Allocated Objects
field ordering for data usage patterns–to improve data cache performance–to remove redundant object accesses
a = new square;b = new square;c = a; …a.side = 5;
b.side = 10;z = c.side;
a = new square;b = new square;c = a;…t1 = 5;a.side = t1;b.side = 10z = t1;
redundant getfield (load) removal
Low-Level Optimizations
array range and null reference checking is significant in object-oriented HLL VMs
array range and null reference checking for throwing exception may cause two performance losses–overhead needed to perform check itself–some optimizations are inhibited for a precise state
Low-Level Optimizations
p = new Zq = new Zr = p …p.x = … <null check p>… = p.x <null check p> …q.x = … <null check q> …r.x = … <null check r(p)>
p = new Zq = new Zr = p …p.x = … <null check p>… = p.x …r.x = …q.x = … <null check q>
Removing Redundant Null Checks
Low-Level Optimizations
Hoisting an Invariant Check–checking can be hoisted outside the loop
for (int i=0;i<j;i++) { sum += A[i]; <range check A>}
if (j < A.length)then for (int i=0;i<j;i++) { sum += A[i];}else for (int i=0;i<j;i++) { sum += A[i]; <range check A>}
Low-Level Optimizations
Loop Peeling–the null check is not needed for the remaining loop iterations
for (int i=0;i<100;i++) { r = A[i]; B[i] = r*2; p.x += A[i]; <null check p>}
r = A[0];B[0] = r*2;p.x = A[0]; <null check p>for (int i=1;i<100;i++) { r = A[i]; p.x += A[i]; B[i] = r*2;}
Optimizing Garbage Collection
Compiler support– Compiler provide the garbage collector with “yield point” at regular intervals in the code. At these points a thread can guarantee a consistent heap state, and control can be yielded to the garbage collector.– Compiler also helps specific garbage-collection algorithm.
Contents
Basic Emulation
High-performance Emulation–Optimization Framework–Optimizations
Case study: The Jikes Research Virtual Machine
The Jikes Research Virtual Machines
an open source Java compiler.
The original version was developed by IBM
It is much faster in compiling small projects than Sun's own compiler.
Unfortunately it is no longer actively being developed.
The Jikes Research Virtual Machines
only to use compile (without interpretation step)
first, compiler translates bytecodes into native code
generated code simply emulates the Java stack, and then optimization is applied
dynamic compiler supports optimization depending on an estimate of cost-benefit
multithreaded implementation
preemptive thread scheduling by control bit
The Jikes Research Virtual Machines
Adaptive Optimization System–runtime measurement system *gathers raw performance data by sampling at yield point–recompilation system–controller *responsible for coordinating the activities *determining optimization level in recomilation
cost-benefit function(to given method)–Cj(recompile time) , Tj(recompiled exec) , Ti(not recompiled)– Cj + Tj < Ti -> recompilation.
The Jikes Research Virtual Machines
Controller
Hot MethodOrganizer
CompilationThread
AOS Database(Profile Data)
Executing Code
OptimizingCompiler
MethodSamples
Instrumentation/Compilation Plan
Instrumented/Optimized Code
RecompilationSubsystem
RuntimeMeasurementSubsystem
Collected sample
New code
Event queue Compilation queue
The Jikes Research Virtual Machines
Level 0–copy/constant propagation, branch optimization, etc.–trivial methods inlining–simple code relayout, register allocation (simple linear scan)
Level 1–higher-level code restructuring : more aggressive inlining, code relayout
Level 2–use a static single assignment intermediate form–SSA allows global optimizations–loop unrolling, eliminating loop-closing branches
Optimization levels
The Jikes Research Virtual Machines
start-up
The Jikes Research Virtual Machines
steady-state
The Jikes Research Virtual Machines
The Jikes Research Virtual Machines