alternative languages on a jvm - oracle...9 deeper dive > clojure - “almost close” good: no...
Post on 20-May-2020
6 Views
Preview:
TRANSCRIPT
Alternative Languageson a JVMTM
Dr Cliff ClickAzul Systemshttp://blogs.azulsystems.com/cliff
2
JVMs are used for more than just Java> JVMs are used for lots of non-Javac bytecodes
● Scala, Clojure, JRuby, Jython, JS/Rhino, etc> Bytecode patterns are different from javac> Bytecode producer is curious:
● Are these bytecodes “fast”? “fast enough”?● Inlining? JIT'ing to good code? (JIT'ing at all?)
> Goal: help the world escape the Java box> New language must have a fast reputation
● Or a large programmer population won't look
3
Can Your Language Go “To The Metal”?> Can your bytecodes go “To The Metal”?
● Is simple code mapped to simple machine ops?> Methodology:
● Write dumb hot SIMPLE loop in lots of languages● Look at performance● Look at JIT'd code: spot the 'xor' and 'divide'● Look for mismatch between lang & JVM & JIT code
for( int i=1; i<100000000; i++ ) sum += (sum^i)/i;
4
Can Your Language Go “To The Metal”?
> NOT-GOAL: Fastest language> NOT TESTED:
● Ease of programming, time-to-market, maintenance ● Domain fit (e.g. Ruby-on-Rails)
> IGNORING:● Blatant language / microbenchmark mismatch
(e.g. FixNum/BigNum issues)
for( int i=1; i<100000000; i++ ) sum += (sum^i)/i;
5
Can Your Language Go “To The Metal”?
> Tested: Java, Scala, Clojure, JRuby, JPC, JS/Rhino> NOT-GOAL: Fastest language> NOT Fair:
● I changed benchmark to be more “fair” to various languages
● e.g. Using 'double' instead of 'int' for JavaScript.> Tested on Azul JVM
● Because it's got the best low-level JVM performance analysis tools I've seen
6
Scorecard> Java (unrolled, showing 1 iter):
add rTmp0, rI, 5 // for unrolled iter #5 xor rTmp1, rSum, rTmp0 div rTmp2, rTmp1, rTmp0 add rSum,rTmp2,rSum ... repeat for unrolled iterations
> Scala – Same as Java! (*but see next slide)> Clojure – Almost close> JPC – Fascinating; pure Java X86 emulation> JRuby – Major inlining not happening
● misled by debugging flag?> JS/Rhino – Death by 'Double' (not 'double')
7
Deeper Dive> Java – Unfair advantage!
Semantics match JIT expectations.
> Scala – Borrowed heavily from Java semantics● Close fit allows close translation ● And thus nearly direct translation to machine code● Penalty: Only have Java semantics
e.g. no 'unsigned' type, no auto-BigNum inflation● “Elegant” Scala version as bad as everybody else
8
Deeper Dive> JPC – Java “PC”; pure Java DOS3.1 emulator
● Massive bytecode generation; LOTS of JIT'ing● For this example: 16000 classes, 7800 compiles● JIT falls down removing redundant X86 flag sets● Maybe fix by never storing 'flags' into memory?● Also no fixnum issue
> Yes I ran “Prince of Persia” and “Mario Brothers” on an Azul box. :-)
9
Deeper Dive> Clojure - “almost close”
● Good: no obvious subroutine calls in inner loop● Bad: Massive “ephemeral” object allocation -
requires good GC ● But needs Escape Analysis to go really fast● Ugly: fix-num overflow checks everywhere● Turn off fix-nums: same speed as Java● Weird “holes” -
● Not-optimized reflection calls here & there● Can get reports on generated reflection calls
10
Deeper Dive> Jython - “almost close”
● Also has fix-num issue; massive allocation● Some extra locks thrown in
> JavaScript/Rhino - “almost close”● All things are Doubles – not 'double'● Same fix-num allocation issues as Clojure, Jython● Otherwise looks pretty good (no fix-num checks)
> JRuby – Missed The Metal● Assuming CachingCallSite::call inlines
● Using -XX:+PrintInlining to determine● But flag is lying: claims method inlined, but it's not
11
Common Issue – Fix-Nums
> Huge allocation rate of “fix-nums”● Allocation is not free● Setting final fields cost a memory fence● GC does NOT cost here
● i.e., object pooling would be a big loser> Could do much better with JIT
● Need “ultra-stupid” Escape Analysis● Need some (easy) JIT tweaking
● e.g. Getting around Integer.valueOf(i) caching
12
JRuby – Missed The Metal
> Each minor action dispatches via “call”> Assuming CachingCallSite::call inlines
● (and allows further inlining)> Using -XX:+PrintInlining to determine> But flag is lying: claims inlined, but it's not
● BimorphicInlining is 'guessing' wrong target> Confirmed w/GDB on X86 Java6 > Confirmed w/Azul profiler
● My 1st impression: can't be inlined ever
13
What Happened to JRuby?
> Calls stitched togetherwith trampoline
CachingCallSite::call
A() { x.call(“add”); }
B() { y.call(“div”); }
C() { z.call(“search”); }
class X { add() {...
class Y { div() {...
class Z { search() {...
14
What Happened to JRuby?
> Goal: inline 'targ' into caller
> JIT inlines trampoline● Folds up complex
lookup logic● Further inlines “targ”
CachingCallSite::call
class X { add() {...
class Y { div() {...
class Z { search() {...
A() { x.call(“add”); }
B() { y.call(“div”); }
C() { z.call(“search”); }
15
What Happened to JRuby?
> Code not analyzable● Needs profile
> No inlining during profiling> Profiling confuses
callers per callsite
CachingCallSite::call
class X { add() {...
class Y { div() {...
class Z { search() {...
A() { x.call(“add”); }
B() { y.call(“div”); }
C() { z.call(“search”); }
16
What Happened to JRuby?
> Code not analyzable● Needs profile
> No inlining during profiling> Profiling confuses
callers per callsite> C2 inlines 'call'
● No profile data● No dominate target● No attempt at guarded inlining
CachingCallSite::call
class X { add() {...
class Y { div() {...
class Z { search() {...
A() { x.call(“add”); }
B() { y.call(“div”); }
C() { z.call(“search”); }
17
Lessons
> CAN take new languages to the metal (e.g. Scala)> Easy to be misled (JRuby)
● Reliance on non-QA'd debugging flag +PrintInlining● Rules on inlining are complex, subtle
> But so much performance depends on it!● And exact details matter: class hierarchy, final,
interface, single-target> Language/bytecode mismatch – can't assume
● e.g. Uber GC or Uber Escape Analysis, or subtle final-field semantics
18
Lessons for Implementors – Inlining Matters
> So much performance depends on inlining!● Details matter: class hierarchy, final, interface
> What do you know about JRockit, IBM, etc?> Heuristics change anyways, so
● what works today is dog-slow tomorrow● Unless your lang hits the standard benchmark list
> Must manually confirm inlining
19
Suggestions for Language Implementors
> How do you when your bytecodes are working well?● No good tools for telling why optimizer bailed out● Same lack for C++, Fortran for the past 50 yrs
> 1- Run microbenchmark w/hot simple code & GDB it● Why? So you can 'read' the assembly
● Then break in w/GDB and – read the assembly● Correct problems
● Lather, rinse, repeat...> 2- Run Your favorite JVM w/debugging flags
● But debugging flags can lie (!!!)● Must cross-correlate with (1)
20
Round 2: Alternative Concurrency
> Clojure uses Software Transactional Memory● Nothing shared by default● All shared variables guarded by STM
> Traveling Salesmen Problem w/worker 'ants'> Tried up to 600 'ants' on a 768-way Azul
● Great scaling (but 20Gig/sec allocation)
21
Round 2: Alternative Concurrency
> Tried contention microbenchmark● Many CPUs bumping shared counter
> Performance died – choked in the STM> Less TOTAL throughput as more CPUs added
● i.e., Live-lock - adding CPUs made things worse● No graceful fall-off under pressure
> JDK 6 Library failure? Clojure usage model failure?
22
Lessons?
> Too little data (e.g. no Scala), just my speculation> Clojure-style MIGHT work, might not
● 600-way thread-pool works well on Java also> NOT graceful under pressure
● Adding more CPUs should not be bad● Throughput cap expected, or maybe slow degrade
– but rapid falloff!?!?● Maybe STM becomes “graceful under pressure” later● (but it's been 15yrs and still not good)
23
Lessons?
> Note: complex locking schemes suffer same way● Example: typical DB● Add more concurrent DB requests,
and DB throughput goes up● ...then peaks, then down, ● ...then craters
> Why doesn't DB queue requests & maintain max throughput?● (Maybe it does, but it's a hard problem to get right?)
24
Future Big Problem
> Fall-off Under Load is Bad (FULB™)> Reliable performance “under pressure”
● Adding Threads to a hot lock drops thruput some● ...then throughput stabilizes as more Threads added
● But adding CPUs to a wait/notify loop peaks thruput● ...then throughput falls constantly as most threads
uselessly wakeup & sleep● All time is spent in context switching
> Everybody has a max-throughput saturation point> Naively: just queue requests beyond saturation point
25
Future Big Problem
> Fall-off Under Load is Bad (FULB™)> Everybody has a max-throughput
● Queue requests beyond saturation point> And maintain max-throughput> Publish that (now reliable) max throughput> Predict performance as the min of max-thruput's...
26
Summary
> Non-Java languages CAN run fast on a JVM> But it takes care
● Especially around inlining> Concurrent Programmng is NOT a solved problem> But some promising alternatives are being explored
● Clojure's STM approach is unique● Scala is taking another tact
> Reliable performance under load remains an issue
Dr Cliff Clickcliffc@acm.org
http://blogs.azulsystems.com/cliff
top related