alternative languages on a jvm - oracle...9 deeper dive > clojure - “almost close” good: no...

Alternative Languageson a JVMTM

Dr Cliff ClickAzul Systemshttp://blogs.azulsystems.com/cliff

JVMs are used for more than just Java> JVMs are used for lots of non-Javac bytecodes

● Scala, Clojure, JRuby, Jython, JS/Rhino, etc> Bytecode patterns are different from javac> Bytecode producer is curious:

● Are these bytecodes “fast”? “fast enough”?● Inlining? JIT'ing to good code? (JIT'ing at all?)

> Goal: help the world escape the Java box> New language must have a fast reputation

● Or a large programmer population won't look

Can Your Language Go “To The Metal”?> Can your bytecodes go “To The Metal”?

● Is simple code mapped to simple machine ops?> Methodology:

● Write dumb hot SIMPLE loop in lots of languages● Look at performance● Look at JIT'd code: spot the 'xor' and 'divide'● Look for mismatch between lang & JVM & JIT code

for( int i=1; i<100000000; i++ ) sum += (sum^i)/i;

Can Your Language Go “To The Metal”?

> NOT-GOAL: Fastest language> NOT TESTED:

● Ease of programming, time-to-market, maintenance ● Domain fit (e.g. Ruby-on-Rails)

> IGNORING:● Blatant language / microbenchmark mismatch

(e.g. FixNum/BigNum issues)

for( int i=1; i<100000000; i++ ) sum += (sum^i)/i;

Can Your Language Go “To The Metal”?

> Tested: Java, Scala, Clojure, JRuby, JPC, JS/Rhino> NOT-GOAL: Fastest language> NOT Fair:

● I changed benchmark to be more “fair” to various languages

● e.g. Using 'double' instead of 'int' for JavaScript.> Tested on Azul JVM

● Because it's got the best low-level JVM performance analysis tools I've seen

Scorecard> Java (unrolled, showing 1 iter):

add rTmp0, rI, 5 // for unrolled iter #5 xor rTmp1, rSum, rTmp0 div rTmp2, rTmp1, rTmp0 add rSum,rTmp2,rSum ... repeat for unrolled iterations

> Scala – Same as Java! (*but see next slide)> Clojure – Almost close> JPC – Fascinating; pure Java X86 emulation> JRuby – Major inlining not happening

● misled by debugging flag?> JS/Rhino – Death by 'Double' (not 'double')

Deeper Dive> Java – Unfair advantage!

Semantics match JIT expectations.

> Scala – Borrowed heavily from Java semantics● Close fit allows close translation ● And thus nearly direct translation to machine code● Penalty: Only have Java semantics

e.g. no 'unsigned' type, no auto-BigNum inflation● “Elegant” Scala version as bad as everybody else

Deeper Dive> JPC – Java “PC”; pure Java DOS3.1 emulator

● Massive bytecode generation; LOTS of JIT'ing● For this example: 16000 classes, 7800 compiles● JIT falls down removing redundant X86 flag sets● Maybe fix by never storing 'flags' into memory?● Also no fixnum issue

> Yes I ran “Prince of Persia” and “Mario Brothers” on an Azul box. :-)

Deeper Dive> Clojure - “almost close”

● Good: no obvious subroutine calls in inner loop● Bad: Massive “ephemeral” object allocation -

requires good GC ● But needs Escape Analysis to go really fast● Ugly: fix-num overflow checks everywhere● Turn off fix-nums: same speed as Java● Weird “holes” -

● Not-optimized reflection calls here & there● Can get reports on generated reflection calls

Deeper Dive> Jython - “almost close”

● Also has fix-num issue; massive allocation● Some extra locks thrown in

> JavaScript/Rhino - “almost close”● All things are Doubles – not 'double'● Same fix-num allocation issues as Clojure, Jython● Otherwise looks pretty good (no fix-num checks)

> JRuby – Missed The Metal● Assuming CachingCallSite::call inlines

● Using -XX:+PrintInlining to determine● But flag is lying: claims method inlined, but it's not

Common Issue – Fix-Nums

> Huge allocation rate of “fix-nums”● Allocation is not free● Setting final fields cost a memory fence● GC does NOT cost here

● i.e., object pooling would be a big loser> Could do much better with JIT

● Need “ultra-stupid” Escape Analysis● Need some (easy) JIT tweaking

● e.g. Getting around Integer.valueOf(i) caching

JRuby – Missed The Metal

> Each minor action dispatches via “call”> Assuming CachingCallSite::call inlines

● (and allows further inlining)> Using -XX:+PrintInlining to determine> But flag is lying: claims inlined, but it's not

● BimorphicInlining is 'guessing' wrong target> Confirmed w/GDB on X86 Java6 > Confirmed w/Azul profiler

● My 1st impression: can't be inlined ever

What Happened to JRuby?

> Calls stitched togetherwith trampoline

CachingCallSite::call

A() { x.call(“add”); }

B() { y.call(“div”); }

C() { z.call(“search”); }

class X { add() {...

class Y { div() {...

class Z { search() {...

> Goal: inline 'targ' into caller

> JIT inlines trampoline● Folds up complex

lookup logic● Further inlines “targ”

> Code not analyzable● Needs profile

> No inlining during profiling> Profiling confuses

callers per callsite

> Code not analyzable● Needs profile

> No inlining during profiling> Profiling confuses

callers per callsite> C2 inlines 'call'

● No profile data● No dominate target● No attempt at guarded inlining

Lessons

> CAN take new languages to the metal (e.g. Scala)> Easy to be misled (JRuby)

● Reliance on non-QA'd debugging flag +PrintInlining● Rules on inlining are complex, subtle

> But so much performance depends on it!● And exact details matter: class hierarchy, final,

interface, single-target> Language/bytecode mismatch – can't assume

● e.g. Uber GC or Uber Escape Analysis, or subtle final-field semantics

Lessons for Implementors – Inlining Matters

> So much performance depends on inlining!● Details matter: class hierarchy, final, interface

> What do you know about JRockit, IBM, etc?> Heuristics change anyways, so

● what works today is dog-slow tomorrow● Unless your lang hits the standard benchmark list

> Must manually confirm inlining

Suggestions for Language Implementors

> How do you when your bytecodes are working well?● No good tools for telling why optimizer bailed out● Same lack for C++, Fortran for the past 50 yrs

> 1- Run microbenchmark w/hot simple code & GDB it● Why? So you can 'read' the assembly

● Then break in w/GDB and – read the assembly● Correct problems

● Lather, rinse, repeat...> 2- Run Your favorite JVM w/debugging flags

● But debugging flags can lie (!!!)● Must cross-correlate with (1)

Round 2: Alternative Concurrency

> Clojure uses Software Transactional Memory● Nothing shared by default● All shared variables guarded by STM

> Traveling Salesmen Problem w/worker 'ants'> Tried up to 600 'ants' on a 768-way Azul

● Great scaling (but 20Gig/sec allocation)

Round 2: Alternative Concurrency

> Tried contention microbenchmark● Many CPUs bumping shared counter

> Performance died – choked in the STM> Less TOTAL throughput as more CPUs added

● i.e., Live-lock - adding CPUs made things worse● No graceful fall-off under pressure

> JDK 6 Library failure? Clojure usage model failure?

Lessons?

> Too little data (e.g. no Scala), just my speculation> Clojure-style MIGHT work, might not

● 600-way thread-pool works well on Java also> NOT graceful under pressure

● Adding more CPUs should not be bad● Throughput cap expected, or maybe slow degrade

– but rapid falloff!?!?● Maybe STM becomes “graceful under pressure” later● (but it's been 15yrs and still not good)

Lessons?

> Note: complex locking schemes suffer same way● Example: typical DB● Add more concurrent DB requests,

and DB throughput goes up● ...then peaks, then down, ● ...then craters

> Why doesn't DB queue requests & maintain max throughput?● (Maybe it does, but it's a hard problem to get right?)

Future Big Problem

> Fall-off Under Load is Bad (FULB™)> Reliable performance “under pressure”

● Adding Threads to a hot lock drops thruput some● ...then throughput stabilizes as more Threads added

● But adding CPUs to a wait/notify loop peaks thruput● ...then throughput falls constantly as most threads

uselessly wakeup & sleep● All time is spent in context switching

> Everybody has a max-throughput saturation point> Naively: just queue requests beyond saturation point

Future Big Problem

> Fall-off Under Load is Bad (FULB™)> Everybody has a max-throughput

● Queue requests beyond saturation point> And maintain max-throughput> Publish that (now reliable) max throughput> Predict performance as the min of max-thruput's...

Summary

> Non-Java languages CAN run fast on a JVM> But it takes care

● Especially around inlining> Concurrent Programmng is NOT a solved problem> But some promising alternatives are being explored

● Clojure's STM approach is unique● Scala is taking another tact

> Reliable performance under load remains an issue

Dr Cliff Clickcliffc@acm.org

http://blogs.azulsystems.com/cliff

alternative languages on a jvm - oracle...9 deeper dive > clojure - “almost close” good: no...

Documents

clojure & functional programming? -...

ephemeral architecture

clojure class

hadoop + clojure

ephemeral vistas

support.bull.comsupport.bull.com/documentation/byproduct/servers/...frcacacheloadfile...

clojure slides

ephemeral viusals

taocp 1.4.1 subroutine

clojure presentation

clojure monad

clojure concurrency

subroutine guide

clojure-java interop - qcon san francisco 2018 | … · the...

the ideas of clojure - things i learn from clojure

clojure testing with midje - inovex gmbh · clojure testing...

core.logic (clojure)

ephemeral affairs

ephemeral project motivation active-message languages...

programming clojure