sista: improving cog’s jit performance

71
Sista: Improving Cog’s JIT performance Clément Béra

Upload: esug

Post on 20-May-2015

464 views

Category:

Software


6 download

DESCRIPTION

Title: Sista: Improving Cog’s JIT performance Speaker: Clément Béra Thu, August 21, 9:45am – 10:30am Video Part1 https://www.youtube.com/watch?v=X4E_FoLysJg Video Part2 https://www.youtube.com/watch?v=gZOk3qojoVE Description Abstract: Although recent improvements of the Cog VM performance made it one of the fastest available Smalltalk virtual machine, the overhead compared to optimized C code remains important. Efficient industrial object oriented virtual machine, such as Javascript V8's engine for Google Chrome and Oracle Java Hotspot can reach on many benchs the performance of optimized C code thanks to adaptive optimizations performed their JIT compilers. The VM becomes then cleverer, and after executing numerous times the same portion of codes, it stops the code execution, looks at what it is doing and recompiles critical portion of codes in code faster to run based on the current environment and previous executions. Bio: Clément Béra and Eliot Miranda has been working together on Cog's JIT performance for the last year. Clément Béra is a young engineer and has been working in the Pharo team for the past two years. Eliot Miranda is a Smalltalk VM expert who, among others, has implemented Cog's JIT and the Spur Memory Manager for Cog.

TRANSCRIPT

Page 1: Sista: Improving Cog’s JIT performance

Sista: Improving Cog’s JIT

performanceClément Béra

Page 2: Sista: Improving Cog’s JIT performance

Main people involved in Sista• Eliot Miranda

• Over 30 years experience in Smalltalk VM

• Clément Béra

• 2 years engineer in the Pharo team

• Phd student starting from october

Page 3: Sista: Improving Cog’s JIT performance

Cog ?

• Smalltalk virtual machine

• Runs Pharo, Squeak, Newspeak, ...

Page 4: Sista: Improving Cog’s JIT performance

Plan

• Efficient VM architecture (Java, C#, ...)

• Sista goals

• Example: optimizing a method

• Our approach

• Status

Page 5: Sista: Improving Cog’s JIT performance

Virtual machineHigh level code (Java, Smalltalk ...)

CPU: intel, ARM ... - OS: Windows, Android, Linux ..

Runtime environment (JRE, JDK, Object engine...)

Page 6: Sista: Improving Cog’s JIT performance

Basic Virtual machineHigh level code (Java, Smalltalk ...)

Compiler

Bytecode Interpreter

Memory Manager

CPU RAM

BytecodeVM

Page 7: Sista: Improving Cog’s JIT performance

Fast virtual machineHigh level code (Java, Smalltalk ...)

Compiler

Bytecode Interpreter

JIT Compiler

Memory Manager

CPU RAM

BytecodeVM

Page 8: Sista: Improving Cog’s JIT performance

Efficient JIT compilerdisplay: listOfDrinks

listOfDrinks do: [ :drink |self displayOnScreen: drink ]

Execution number

time to run (ms) Comments

1 1 lookup of #displayOnScreen (cache) and byte code interpretation

2 to 6 0,5 byte code interpretation

7 2 generation of native code for displayOnScreen and native code run

•8 to 999 0,01 native code run

1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run

1000 + 0,002 optimized native code run

Page 9: Sista: Improving Cog’s JIT performance

In Cogdisplay: listOfDrinks

listOfDrinks do: [ :drink |self displayOnScreen: drink ]

Execution number

time to run (ms) Comments

1 1 lookup of #displayOnScreen (cache) and byte code interpretation

2 to 6 0,5 byte code interpretation

7 2 generation of native code for display and native code run

•8 to 999 0,01 native code run

1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run

1000 + 0,002 optimized native code run

Page 10: Sista: Improving Cog’s JIT performance

A look into webkit• Webkit Javascript engine

• LLInt = Low Level Interpreter

• DFG JIT = Data Flow Graph JIT

Page 11: Sista: Improving Cog’s JIT performance

Webkit JS benchs

Page 12: Sista: Improving Cog’s JIT performance

In Cog

Page 13: Sista: Improving Cog’s JIT performance

What is Sista ?

• Implementing the next JIT optimization level

Page 14: Sista: Improving Cog’s JIT performance

Goals

• Smalltalk performance

• Code readability

Page 15: Sista: Improving Cog’s JIT performance

Smalltalk performanceThe Computer Language Benchmarks Game

Page 16: Sista: Improving Cog’s JIT performance

Smalltalk performanceThe Computer Language Benchmarks Game

Page 17: Sista: Improving Cog’s JIT performance

Smalltalk performanceThe Computer Language Benchmarks Game

Smalltalk is 4x~25x slower than Java

Page 18: Sista: Improving Cog’s JIT performance

Smalltalk performanceThe Computer Language Benchmarks Game

Smalltalk is 4x~25x slower than Java

Our goal is 3x faster:1.6x~8x times slower than Java

+ Smalltalk features

Page 19: Sista: Improving Cog’s JIT performance

Code Readability

• Messages optimized by the bytecode compiler overused in the kernel

• #do: => #to:do:

Page 20: Sista: Improving Cog’s JIT performance

Adaptive recompilation

• Recompiles on-the-fly portion of code frequently used based on the current environment and previous executions

Page 21: Sista: Improving Cog’s JIT performance

Optimizing a method

Page 22: Sista: Improving Cog’s JIT performance

Example

Page 23: Sista: Improving Cog’s JIT performance

Example

• Executing #display: with over 30 000 different drinks ...

Page 24: Sista: Improving Cog’s JIT performance

Hot spot detector

• Detects methods frequently used

Page 25: Sista: Improving Cog’s JIT performance

Hot spot detector• JIT adds counters on machine code

• Counters are incremented when code execution reaches it

• When a counter reaches threshold, the optimizer is triggered

Page 26: Sista: Improving Cog’s JIT performance

Example

Can detect hot spot(#to:do: compiled inlined)

Cannot detect hot spot

Page 27: Sista: Improving Cog’s JIT performance

Hot spot detected

Over 30 000 executions

Page 28: Sista: Improving Cog’s JIT performance

Optimizer• What to optimize ?

Page 29: Sista: Improving Cog’s JIT performance

Optimizer• What to optimize ?

• Block evaluation is costly

Page 30: Sista: Improving Cog’s JIT performance

Optimizer• What to optimize ?

Stack frame#display:

Stack frame#do:

Stack growing downHot spot

detected here

Page 31: Sista: Improving Cog’s JIT performance

Optimizer• What to optimize ?

Stack frame#display:

Stack frame#do:

Hot spot detected

here

Page 32: Sista: Improving Cog’s JIT performance

Find the best method

Page 33: Sista: Improving Cog’s JIT performance

Find the best method

• To be able to optimize the block activation, we need to optimize both #example and #do:

Page 34: Sista: Improving Cog’s JIT performance

Inlining• Replaces a function call by its callee

Page 35: Sista: Improving Cog’s JIT performance

Inlining• Replaces a function call by its callee

• Speculative: what if listOfDrinks is not an Array anymore ?

Page 36: Sista: Improving Cog’s JIT performance

Inlining• Replaces a function call by its callee

• Speculative: what if listOfDrinks is not an Array anymore ?

• Type-feedback from inline caches

Page 37: Sista: Improving Cog’s JIT performance

Guard

• Guard checks: listOfDrinks hasClass: Array

• If a guard fails, the execution stack is dynamically deoptimized

• If a guard fails more than a threshold, the optimized method is uninstalled

Page 38: Sista: Improving Cog’s JIT performance

Dynamic deoptimization• On Stack replacement

• PCs, variables and methods are mapped.

Stack frame#display:

Stack frame#do:

Stack frame#optimizedDisplay:

Page 39: Sista: Improving Cog’s JIT performance

Optimizations• 1) #do: is inlined into #display:

Page 40: Sista: Improving Cog’s JIT performance

Optimizations• 2) Block evaluation is inlined

Page 41: Sista: Improving Cog’s JIT performance

Optimizations• 3) at: is optimized

• at: optimized ???

Page 42: Sista: Improving Cog’s JIT performance

At: implementation

• VM implementation

• Is index an integer ?

• Is object a variable-sized object, a byte object, ... ? (Array, ByteArray, ... ?)

• Is index within the bound of the object ?

• Answers the value

Page 43: Sista: Improving Cog’s JIT performance

Optimizations• 3) at: is optimized

• index isSmallInteger

• 1 <= index <= listOfDrinks size

• listOfDrinks class == Array

• uncheckedAt1: (at: for variable size objects with in-bounds integer argument)

Page 44: Sista: Improving Cog’s JIT performance

Restarting• On Stack replacement

Stack frame#display:

Stack frame#do:

Stack frame#optimizedDisplay:

Page 45: Sista: Improving Cog’s JIT performance

Our approach

Page 46: Sista: Improving Cog’s JIT performance

In-image Optimizer• Inputs

• Stack with hot compiled method

• Branch and type information

• Actions

• Generates and installs an optimized compiled method

• Generates deoptimization metadata

• Edit the stack to use the optimized compiled method

Page 47: Sista: Improving Cog’s JIT performance

In-image Deoptimizer• Inputs

• Stack to deoptimize (failing guard, debugger, ...)

• Deoptimization metadata

• Actions

• Deoptimize the stack

• May discard the optimized method

Page 48: Sista: Improving Cog’s JIT performance

Advantages

• Optimizer / Deoptimizer in smalltalk

• Easier to debug and program in Smalltalk than in Slang / C

• Editable at runtime

• Lower engineering cost

Page 49: Sista: Improving Cog’s JIT performance

Advantages

Stack frame#display:

Stack frame#do:

Stack frame#optimizedDisplay:

On Stack replacement done with Context manipulation

Page 50: Sista: Improving Cog’s JIT performance

Cons

• What if the optimizer is triggered while optimizing ?

Page 51: Sista: Improving Cog’s JIT performance

Advantages

• CompiledMethod to optimized CompiledMethod

• Snapshot saves compiled methods

Page 52: Sista: Improving Cog’s JIT performance

Cons

• Some optimizations are difficult

• Instruction selection

• Instruction scheduling

• Register allocation

Page 53: Sista: Improving Cog’s JIT performance

Critical optimizations

• Inlining (Methods and Closures)

• Specialized at: and at:put:

• Specialized arithmetic operations

Page 54: Sista: Improving Cog’s JIT performance

Interface VM - Image

• Extended bytecode set

• CompiledMethod introspection primitive

• VM callback to trigger optimizer /deoptimizer

Page 55: Sista: Improving Cog’s JIT performance

Extended bytecode set

• Guards

• Specialized inlined primitives

• at:, at:put:

• +, - , <=, =, ...

Page 56: Sista: Improving Cog’s JIT performance

Introspection primitive• Answers

• Type info for each message send

• Branch info for each conditional jump

• Answers nothing if method not in machine code zone

Page 57: Sista: Improving Cog’s JIT performance

VM callback

• Selector in specialObjectsArray

• Sent to the active context when hot spot / guard failure detected

Page 58: Sista: Improving Cog’s JIT performance

Stability

• Low engineering resources

• No thousands of human testers

Page 59: Sista: Improving Cog’s JIT performance

Stability

• Gemstone style ?

• Large test suites

Page 60: Sista: Improving Cog’s JIT performance

Stability

• RMOD research team

• Inventing new validation techniques

• Implementing state-of-the-art validators

Page 61: Sista: Improving Cog’s JIT performance

Stability goal

• Both bench and large test suite run on CI

• Anyone could contribute

Page 62: Sista: Improving Cog’s JIT performance

Anyone can contribute

Page 63: Sista: Improving Cog’s JIT performance

Anyone can contribute

Page 64: Sista: Improving Cog’s JIT performance

Stability goal

• Both bench and large test suite run on CI

• Anyone could contribute

Page 65: Sista: Improving Cog’s JIT performance

Status• A full round trip works

• Hot spot detected in the VM

• In-image optimizer finds a method to optimize, optimizes it an install it

• method and closure inlining

• Stack dynamic deoptimization

Page 66: Sista: Improving Cog’s JIT performance

Status: hot spot detector• Richards benchmark

• Cog: 341ms

• Cog + hot spot detector: 351ms

• 3% overhead on Richards

• Detects hot spots

• Branch counts ⇒ basic block counts

Page 67: Sista: Improving Cog’s JIT performance

Next steps• Dynamic deoptimization of closures is not

stable enough

• Efficient new bytecode instructions

• Stabilizing bounds check elimination

• lowering memory footprint

• Invalidation of optimized methods

• On stack replacement

Page 68: Sista: Improving Cog’s JIT performance

Thanks

• Stéphane Ducasse

• Marcus Denker

• Yaron Kashai

Page 69: Sista: Improving Cog’s JIT performance

Conclusion

• We hope to have a prototype running for Christmas

• We hope to push it to production in around a year

Page 70: Sista: Improving Cog’s JIT performance

Demo

...

Page 71: Sista: Improving Cog’s JIT performance

Questions