Download - HLL VM Implementation

chap6.5~6.7

김정기

Kim, Jung ki

October 11th, 2006

HLL VM Implementation HLL VM Implementation

System programming 특강 , 2006

Contents

Basic Emulation

High-performance Emulation–Optimization Framework–Optimizations

Case study: The Jikes Research Virtual Machine

Basic Emulation

The emulation engine in a JVM can be implemented in a number of ways–interpretation–just-in-time compilation(JIT)

JIT–Methods are compiled at the time they are first invoked–JIT compilation is enabled because the Java ISA’s instructions in a method can easily be discovered

JIT vs conventional compiler

JIT doesn’t have a frontend for parsing and syntax checking before intermediate form

Different intermediate form before optimization

Optimization strategy–multiple optimization levels through profiling–applying optimizations selectively to hot spot(not entire method)

Examples–interpretation : Sun HotSpot, IBM DK–compilation : Jikes RVM

Contents

Basic Emulation



High-Performance Emulation

Two challenges for HLL VMs–to offset run-time optimization overhead with execution-time improvement–to make an object-oriented program go fast

Frequent use of addressing indirection and small methods

Optimization Framework

Host Platform

Interpreter

Bytecodes

Profile Data CompiledCode

OptimizedCode

SimpleCompiler

OptimizingCompiler

translated codeprofile data

Basic Emulation

High-performance Emulation–Optimization Framework–Optimizations•Code Relayout•Method Inlining•Optimizing Virtual Method Calls•Multiversioning and Specialization•On-Stack Replacement•Optimization of Heap-Allocated Objects• Low-Level Optimizations•Optimizing Garbage Collection


Contents

Code Relayout

the most commonly followed control flow paths are in contiguous location in memory

improved locality and conditional branch predictability

Code Relayout

A

B D

CF

G97

30

1

1

70

29

1

3

68

E

6829

2

ABr cond1 = = falseDBr cond3 = = true

F

Br uncond

G

Br cond2 = = false

E

Br uncond

B

C

Br cond4 = = true

Br uncond

Method Inlining

Benefits– calling overheads decrease especially in object-oriented• passing parameters• managing stack frame• control transfer

– code analysis scope expands

• more optimizations are applicable.

Effects may be different by method’s size– small method : beneficial in most of cases – large method : low portion of calling sequence, sophisticated

cost-benefit analysis is needed -> code explosion may occur : poor cache behavior, performance losses

Method Inlining

Processing sequence1. profiling by instrument2. constructing call-graph at certain intervals3. invoking dynamic optimization system when call counts exceeds certain threshold

Reducing analysis overhead–profile counter is included in stack frame.–When meet the threshold, “walk” backward through the stack

Method Inlining

MAIN

A X

B C Y

900 100

1500100 100025

MAIN

900

A

1500

C

With a call graph

via stack frame

threshold

threshold

Optimizing Virtual Method Calls

–the most common case–Determination of which code to use is done at run time via a dynamic method table lookup.

Invokevirtual <perimeter>

If (a.isInstanceof(Sqaure)) {inlined code…

.

.}Else invokevirtual <perimeter>

Optimizing Virtual Method Calls

If inlining is not useful, just removing method table lookup is also helpful

Polymorphic Inline Caching

…invokevirtual

perimeter…

…call PIC stub

…

if type = circle jump to circle perimeter codeelse if type = square jump to square perimeter codeelse call lookup

circle perimeter code

square perimeter code

update PIC stub;method table lookup code

polymorphic Inline Cache stub

Multiversioning and Specialization

Multiversioning by specialization–If some variables or references are always assigned data values or types known to be constant (or from a limited range)– simplified, specialized code can be used

for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];}

for (int i=0;i<1000;i++) { if (A[i] ==0 )

B[i] = 0;

}

if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];

specialized code

Multiversioning and Specialization

defered compilation of the general case

for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];}

for (int i=0;i<1000;i++) { if (A[i] ==0 )

B[i] = 0;

}

jump to dynamic compiler for deferred compilation

On-Stack Replacement

due to no benefit until the next call, OSR is needed

Implementation stack needs to be modified on the fly.

OSR is needed in this case–inlining in long-running method–defered compilation–debugging (user expect to observe the architected instruction sequence)


stack

implementationframe A

stack

implementationframe B

methodcode

opt. level x

architectedframe

methodcode

opt. level y

optimize/de-optimizemethod code

1. extract architected state

2. generate a new implementation frame

3. replace the current implementation stack frame


OSR is a complex operation

If the initial stack frame is maintained by an interpreter or an nonoptimizing compiler, then extracting architected stack state straightforward

On the other hand, compiler may define a set of program points where OSR can potentially occur and then ensure that the architected values are live at that point in the execution.


Meaning of OSR–state performance benefits are small–allowing the implementation of debuggers–reducing start-up time in defered compilation –improving cache performance

Optimization of Heap-Allocated Objects

Creating objects and garbage collection have high cost

the code for the heap allocation and object initialization can be inlined for frequently allocated objects

scalar replacement–escape analysis–effective for reducing object access delays


Scarlar Replacement

-access delays are reducedclass square {int side;int area;}void calculate() { a = new square(); a.side = 3; a.area = a.side * a.side; System.out.println(a.area);}

void calculate() { int t1 = 3; int t2 = t1 * t1; System.out.println(t2);}


field ordering for data usage patterns–to improve data cache performance–to remove redundant object accesses

a = new square;b = new square;c = a; …a.side = 5;

b.side = 10;z = c.side;

a = new square;b = new square;c = a;…t1 = 5;a.side = t1;b.side = 10z = t1;

redundant getfield (load) removal

Low-Level Optimizations

array range and null reference checking is significant in object-oriented HLL VMs

array range and null reference checking for throwing exception may cause two performance losses–overhead needed to perform check itself–some optimizations are inhibited for a precise state


p = new Zq = new Zr = p …p.x = … <null check p>… = p.x <null check p> …q.x = … <null check q> …r.x = … <null check r(p)>

p = new Zq = new Zr = p …p.x = … <null check p>… = p.x …r.x = …q.x = … <null check q>

Removing Redundant Null Checks


Hoisting an Invariant Check–checking can be hoisted outside the loop

for (int i=0;i<j;i++) { sum += A[i]; <range check A>}

if (j < A.length)then for (int i=0;i<j;i++) { sum += A[i];}else for (int i=0;i<j;i++) { sum += A[i]; <range check A>}


Loop Peeling–the null check is not needed for the remaining loop iterations

for (int i=0;i<100;i++) { r = A[i]; B[i] = r*2; p.x += A[i]; <null check p>}

r = A[0];B[0] = r*2;p.x = A[0]; <null check p>for (int i=1;i<100;i++) { r = A[i]; p.x += A[i]; B[i] = r*2;}

Optimizing Garbage Collection

Compiler support– Compiler provide the garbage collector with “yield point” at regular intervals in the code. At these points a thread can guarantee a consistent heap state, and control can be yielded to the garbage collector.– Compiler also helps specific garbage-collection algorithm.

Contents

Basic Emulation



The Jikes Research Virtual Machines

an open source Java compiler.

The original version was developed by IBM

It is much faster in compiling small projects than Sun's own compiler.

Unfortunately it is no longer actively being developed.

http://en.wikipedia.org/wiki/Open_source

http://en.wikipedia.org/wiki/Java_programming_language

http://en.wikipedia.org/wiki/IBM


only to use compile (without interpretation step)

first, compiler translates bytecodes into native code

generated code simply emulates the Java stack, and then optimization is applied

dynamic compiler supports optimization depending on an estimate of cost-benefit

multithreaded implementation

preemptive thread scheduling by control bit


Adaptive Optimization System–runtime measurement system *gathers raw performance data by sampling at yield point–recompilation system–controller *responsible for coordinating the activities *determining optimization level in recomilation

cost-benefit function(to given method)–Cj(recompile time) , Tj(recompiled exec) , Ti(not recompiled)– Cj + Tj < Ti -> recompilation.


Controller

Hot MethodOrganizer

CompilationThread

AOS Database(Profile Data)

Executing Code

OptimizingCompiler

MethodSamples

Instrumentation/Compilation Plan

Instrumented/Optimized Code

RecompilationSubsystem

RuntimeMeasurementSubsystem

Collected sample

New code

Event queue Compilation queue


Level 0–copy/constant propagation, branch optimization, etc.–trivial methods inlining–simple code relayout, register allocation (simple linear scan)

Level 1–higher-level code restructuring : more aggressive inlining, code relayout

Level 2–use a static single assignment intermediate form–SSA allows global optimizations–loop unrolling, eliminating loop-closing branches

Optimization levels


start-up


steady-state

Download - HLL VM Implementation

Top Related