transactional runtime extensions for dynamic...

9

Click here to load reader

Upload: donguyet

Post on 13-Apr-2018

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

Transactional Runtime Extensionsfor Dynamic Language Performance

Nicholas [email protected]

Craig [email protected]

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

ABSTRACTWe propose exposing best-effort atomic execution, as pro-vided by a simple hardware transactional memory (HTM),in a managed runtime’s bytecode interface. Dynamic lan-guage implementations built on such a runtime can gener-ate more efficient, code, using speculation to eliminate theoverhead and obstructions to optimization incurred by codeneeded to preserve rarely used language semantics. In thisinitial work, we demonstrate the applicability of this frame-work with two optimizations in Jython, a Python imple-mentation for the Java virtual machine, that would yieldspeedups between 13% and 38% in the presence of HTM.Three additional optimizations, which we applied by hand toJython-generated code, provide an additional 60% speedupfor one benchmark.

1. INTRODUCTIONDynamic languages such as Python and Ruby are increas-ingly popular choices for general-purpose application devel-opment. Programmers value the languages’ unique features,flexibility, pragmatic design and continued evolution; theyare eager to apply them to a widening range of tasks. How-ever, performance can be limiting: most programs writ-ten in these languages are executed by interpreters. At-tempts to build more faster, more sophisticated implemen-tations (e.g., [5, 11, 15]) have been frustrated by the need topreserve compatibility with the languages’ flexible seman-tics.

We evaluate the effectiveness of speculatively executing Py-thon programs, where possible, with a common case sub-set of these semantics which correctly handles most Pythoncode. It it difficult to efficiently perform this speculation insoftware alone, but with hardware support for transactionalexecution, easier-to-optimize speculative code can execute atfull speed. When behavior unsupported by the common casesemantics is needed, our implementation initiates hardware-assisted rollback, recovery and nontransactional reexecutionwith full semantic fidelity. Overall, our approach facili-

tates improved dynamic language performance with mini-mal cost in implementation complexity when compared withsoftware-only approaches.

Our framework consists of three components. First, an ef-ficient hardware checkpoint/rollback mechanism provides aset of transactional memory (TM) primitives to a Java vir-tual machine (JVM). Second, the JVM exposes the TM’scapabilities as an interface for explicit speculation. Finally,dynamic language runtimes written in Java employ theseextensions at the implementation language level in orderto speculatively execute code with common case semantics.With hardware and JVM support for speculation, the dy-namic language implementation doesn’t need to know a pri-ori when the common case applies: it can simply try it atruntime.

We demonstrate this approach by modifying a JVM dynamiclanguage runtime—Jython [2], a Python implementation—to generate two copies of code (e.g., Figure 1). First, a spec-ulatively specialized version implements a commonly usedsubset of Python semantics in Java, which by simplifying oreliminating unused data structures and control flow, exposesadditional optimization opportunities to the JVM. Second,a nonspeculative version consists of the original implemen-tation, which provides correct execution in all cases. Exe-cution begins with the faster, speculative version. If spec-ulative code encounters an unsupported condition, an ex-plicit transactional abort transfers control to the nonspec-ulative version. As a result, the JVM does not need dy-namic language-specific adaptations, and high-performancedynamic language implementations can be written entirelyin languages such as Java and C#, maintaining the proper-ties of safety, security and ease of debugging which distin-guish them from traditional interpreters written in C. Evenwithout execution feedback to the dynamic language im-plementation, transactional speculation leads to significantperformance gains in single-threaded code. In three Pythonbenchmarks run on Jython, a few simple speculation tech-niques improved performance between 13 and 38%.

The remainder of this paper is organized as follows. Sec-tion 2 introduces the challenges of dynamic language run-time performance. Section 3.3 describes the componentsof our framework and the optimizations we implemented inJython, section 4 presents our experimental method and sec-tion 5 discusses the performance results.

Page 2: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

y = g(x)z = 2﹡y Compile

y = g(x)g = frame.getglobal("g") ➝ Python function objectx = frame.getglobal("x") ➝ Python integer objecty = g.__call__(x) ➝ Python integer object

g.func_code.call(x, globals, closure)f = new PyFrame(g.func_code, globals)add x to Python frame objectg.func_code.call(f, closure)

Py.getThreadState() ➝ Python thread state objectfinish setting up frame and closureg.funcs.call_function(g.func_id, f)

g$1(f) [function body]frame.setlocal(1, y)

z = 2﹡y…

Python source Nonspeculative implementation

Specialize

begin transaction

y = g(x)y = g_int$1(global$x) [specialized function body]

z = 2﹡yz = 2﹡y

commit transaction

Speculative implementation

Function modified?Global resolution modified? abort transactionException? Debugging? }

Recovery

Figure 1: Potential transactional memory-aided speculative specialization of Python code.

2. BACKGROUND AND RELATED WORKPython, Ruby, Perl and similar dynamic languages differsubstantially from historically interpreted counterparts suchas Java and Smalltalk in that large portions of their basicdata structures and libraries are written in the runtime’s im-plementation language, typically C, rather than in the dy-namic languages themselves. This choice substantially im-proves the performance and practicality of these languageseven as they remain interpreted.

However, to get the best performance out of interpreted dy-namic language implementations, programmers must ensurethat as much work as possible takes place in native code. Asa result, interpreter implementation details can inspire less-than-optimal programming practices motivated by perfor-mance. For example, in the CPython interpreter, functioncalls are relatively expensive, so performance-critical innerloops are typically written to minimize them. A more so-phisticated runtime could perform inlining and specializa-tion. Or, to sort a list by an arbitrary key, Perl and Pythonhave adopted an idiom known as “decorate-sort-undecorate”in which each list item is transformed to and from a sublistof the form (sort key, item), so the list can be sorted entirelyfrom compiled code instead of repeatedly invoking an inter-preted comparator.

To reduce the need for such workarounds and improve exe-cution performance in general, an obvious next step in dy-namic language evolution is uniting the language and itsimplementation on a single platform which can aggressivelyoptimize both. New dynamic language platforms do thisin one of two ways. Some write the language implementa-tion and/or its libraries in a common language, and build adynamic language-specific runtime to execute it: for exam-ple, Rubinius [12] executes Ruby on a Smalltalk-style VM;PyPy [15] implements a translation infrastructure that cantransform a Python implementation written in a subset ofPython into an interpreter or naıve JIT, which is then com-piled for one of several platforms. While these systems arewritten targeting dynamic language semantics, they mustalso incorporate dynamic optimization techniques and na-tive code interfaces, which represents a nontrivial duplica-tion of effort.

The other option is building on an existing, mature plat-form: the Microsoft Common Language Runtime (CLR) or

JVM. On these platforms, dynamic language code is trans-lated into CLR/JVM bytecode, which references implemen-tation components written in a platform-native language(C#/Java). Dynamic language implementations must gen-erate code and maintain data structures to express the lan-guages’ rich semantics in terms of more limited CLR/JVMfunctionality, even for such basic operations as function in-vocation. This semantic adaptation layer acts as a barrierto optimization by the CLR/JVM, hindering performance.

One example of Python’s flexible semantics is method dis-patch, which in Jython involves call frame object instanti-ation and one or more hash table lookups (new PyFrame(...)and frame.getglobal calls, respectively, as shown in Figure 1).Dispatch must also simulate Python’s multiple inheritanceof classes on Java’s single inheritance model [18]. Each ofthese operations is significantly more costly than dynamicdispatch in Java.

The CLR/JVM platforms could be extended with dynamiclanguage-specific primitives to reduce the cost of adapta-tion. But selecting the right set of primitives is challenging,as each dynamic language has its own unique semantics.We have observed that the majority of a typical dynamiclanguage program doesn’t exploit its language’s full seman-tic flexibility. In fact, existing CLR/JVM platform primi-tives can directly encode common uses of most dynamic lan-guage features, significantly improving performance. How-ever, without a way to identify which operations can be soencoded without observing them during execution—whichwould require hooks into the CLR/JVM internals—dynamiclanguage implementations cannot unconditionally generatefast code for the common case.

While adapting the CLR/JVM to accommodate individualdynamic languages’ semantics would be inadvisable, somegeneral-purpose VM additions can simplify dynamic lan-guage implementation. For example, some CLR-based dy-namic languages are hosted on a layer called the DynamicLanguage Runtime (DLR) [4]. The DLR implements a com-mon dynamic language type system overlayed upon, but notinteroperable with, the underlying CLR type system, andcode generation machinery facilitating a primitive form ofruntime specialization. A DynamicSite facility [8] builds run-time code replacement on two CLR features: delegates (akinto function pointers) and Lightweight Code Generation (low-

Page 3: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

overhead runtime code generation).

Such micro-scale specialization effectively reduces dynamicdispatch overhead but does not substantially eliminate theneed for a dynamic language-specific feedback-directed opti-mization infrastructure. In contrast, our framework for op-timistic specialization exposes optimization opportunity tothe VM, without requiring that the dynamic language im-plementation react to execution feedback. The dynamiclanguage-targeted facilities of the CLR and correspondingforthcoming JVM features (e.g., invokedynamic, anonymousclasses and method hotswapping [16, 17]) are in fact largelycomplementary to our work, allowing it to be extended topermit runtime adaptation to misspeculation.

3. A FRAMEWORK FOR COMMON CASESPECULATION

We now present our framework for common case specula-tive optimization and two specific Jython optimizations weimplemented on this framework.

3.1 Transactional memory supportOur framework employs a relatively simple, best-effort HTMimplementation with support for explicit aborts and redi-recting control flow upon an abort. We assume a collapsedtransaction nesting model, where beginning a transactionessentially increments a counter, ending a transaction decre-ments it, and transaction state commits when the counterreaches zero.

The TM system provides software with a guarantee of atom-icity and isolation. An atomic region either commits suc-cessfully or rolls back, undoing any changes made since theregion began. The HTM makes its best effort to executean atomic region given constraints such as fixed buffer sizes.If outstanding transactional data exceeds the capacity ofthe hardware to buffer it, or in other unsupported circum-stances (e.g., I/O and system calls), the atomic region isaborted [10]. The specific HTM primitives we use are:

Begin transaction (abort PC). Take a checkpoint and be-gin associating register and memory accesses with the trans-action. The abort PC argument is saved to be used whenredirecting control flow in the event the region aborts.

Commit transaction. End the region and atomically com-mit changes.

Abort transaction. Discard changes by reverting to thesaved checkpoint, then jump to the abort PC.

3.2 JVM speculation interfaceOur proposed JVM support for explicit speculation con-structs resembles existing structured exception mechanisms:the speculative code is analagous to the try block, assump-tion violation to throwing an exception, and nonspeculativecode to an exception catch. As a result, we expose an inter-face for explicit speculation in Java source and JVM byte-code with a simple extension to the Java exception model,as shown in Figure 2. The changes do not affect the Javabytecode interface or compiler, which simplifies the imple-mentation and maintains compatibility with existing tools.Instead, changes were confined to the Java class library (to

add a new exception class, methods for introspecting abortstate and region invalidation) and the JVM itself (to outputinstructions recognized by the HTM and implement atomicregion invalidation).

A speculative atomic region (1) and its fallback code (3),which encompasses both the recovery code and original, un-specialized code, are wrapped in a standard JVM exceptionblock and a new exception class. Throwing an exception ofthis class triggers the JVM to emit a HTM abort instruc-tion (2). In fact, exceptions of any class raised within anatomic region will cause the region to abort. Transformingany thrown exception into an abort is advantageous not onlybecause it reduces code size by allowing exception handlersto be eliminated from speculative regions, but because ex-isting JVM exception behavior (e.g., converting a processorexception on a null pointer dereference into a Java excep-tion) can be a more efficient substitute for explicitly checkedassertions, eliminating the corresponding checks from thespeculative path.

A critical advantage provided by speculative atomic regionsis in expanding the scope with which the JVM can performoptimizations. Not only can the specialized code employsimpler data structures, repetitive control flow checking for aparticular condition can be replaced by a single guard. Justas with JVM-delimited regions [10], the JVM can eliminatesuch redundant checks in regions of explicit speculation.

Speculative regions may be generated under the assump-tion that certain global state remains unchanged (e.g., thata function is not redefined). The speculation client codemust keep track of these regions with their correspondingassumptions, and emit code which invalidates these regionswhen necessary.1 It would of course be possible for the codeto check inside each speculative region whether any of theglobal state upon which it depends has changed, but it ismore efficient to use a code bottleneck (a) which can invali-date, or unspecialize, a region while it is not currently exe-cuting. To perform unspecialization, the beginning of the tryblock is replaced with an unconditional branch to nonspec-ulative code, potentially pending generation of replacementspeculative code that handles the new situation. In a catchblock which is executing because the region has been unspe-cialized, the exception’s value is null and the recovery codedoes not execute.

3.3 Dynamic language specializationWith our framework, speculative regions are delineated inadvance of execution—the language implementation doesn’tneed to collect and process runtime feedback to discoverwhere speculation will be successful. The dynamic languagecompiler also need not include explicit compensation codewhich can recover from an exceptional condition encoun-tered at any point in execution of a speculative region. Whenthe atomic region aborts, it reverts any changes it made andredirects control flow to a recovery code block, which reactsto the reason for the abort before resuming with the cor-

1Our implementation exposes references to atomic regionsas an index into a method’s exception table, though we planto implement a lighter-weight, opaque identifier in the futurewhich is better suited for lightweight runtime code genera-tion and the JVM security model.

Page 4: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

goto abort handlerbegin transaction (abort: )

speculative implementation…if (failed speculation precondition)

abort transaction…

commit transaction

try { ← atomic region label L1speculative implementation…if (failed speculation precondition)

throw new SpeculationException();…

} catch (SpeculationException e) {if (e)

recovery codenonspeculative implementation

}

JVM

begin transaction (abort: )speculative implementation…if (failed speculation precondition)

abort transaction…

commit transaction

Java source Transactional memory system

1

2

3

1

2

if (failed speculation precondition for L1)invalidate atomic region L1

if (exception)recovery code

nonspeculative implementation

a

if (failed speculation precondition)a

L1: 3

1

2

if (exception)recovery code

nonspeculative implementation

L1: 3

Figure 2: Java language support for HTM-supported speculative specialization.

responding nonspeculative code as if speculation had neverbeen attempted. If an atomic region aborts because of a con-dition that should result in a dynamic language exception,the nonspeculative code executed afterward accumulates thenecessary exception state.

Our Jython modifications consist of three parts: a general-purpose mechanism which forks code generation into non-speculative and speculative versions, a mechanism for keep-ing track of assumptions of global state (as discussed inthe previous section) and the individual optimizations them-selves. The optimizations’ general form is shown in Figure 3.Common case semantics are statically determined, specula-tive regions are created by extracting the corresponding im-plementation, assertions are inserted to ensure the commoncase scenario is indeed in effect and recovery code is con-structed by testing for and reacting to the violation of eachassertion in the aborted atomic region, then propagatingthe violation to perform unspecialization of other affectedregions if needed.

One of our optimizations involves a simple modification tothe Jython library code; the other includes both compilerand library changes.

3.3.1 Dictionary (HashMap) synchronizationThe Python dictionary data structure provides an unorderedmapping between keys and values. Dictionaries are fre-quently accessed in user code, the default storage for objectattributes (e.g., instance variables) and the basis of Pythonnamespaces in which both code and global data reside. Fastdictionary lookups are essential to Python performance, asnearly every Python function call, method invocation andglobal variable reference involves one or more dictionarylookups.

Both CPython [6] and Jython’s dictionary implementationsinclude specializations tailored to particular use cases. Forexample, the keys of object dictionaries (object.__dict__) arenearly always strings representing attribute names. Jython’sPyStringMap optimizes string-keyed lookup by requiring thatthe keys be interned when stored, so a string hash computa-tion need not be performed during lookup, and permitting

ConcurrentHashMapHashMap

pystone richards pyparsing0.8

1.0

1.2

1.4

Sp

eed

up

vs.

syn

chro

niz

ed H

ash

Map

Figure 4: Relaxing correctness and synchroniza-tion constraints on Python dictionaries. Results ob-tained with HotSpot; consult section 4 for configu-ration details.

Java strings to be directly used as keys, rather than beingwrapped in an adapter implementing Python string seman-tics.

Python dictionary accesses must be performed atomically.Jython enforces this constraint by using a synchronized Hash-Map (hash table) data structure for both general-purposePyDictionary and special-purpose PyStringMap dictionaries. Asingle lock protects against simultaneous access at a cost of8–27% in overall runtime (Figure 4).

Java 1.5 introduced a ConcurrentHashMap implementation [7]which optimizes lock usage. While replacing the synchro-nized HashMap with a ConcurrentHashMap improves perfor-mance overall, the implementation has two drawbacks. First,even on a single-core machine it is slower: on pystone, thenaıve single-lock model performs better. Second and moreseriously, ConcurrentHashMap does not conform to Python se-mantics; in particular, while a Python dictionary’s contentsremain unmodified, repeated attempts to iterate throughit must return key-value pairs in the same order. With

Page 5: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

Original code (unspecialized)

Atomic region(specialized)

Jython library function

Jython library function

Jython library function

Jython generated code

+

Recovery code

• Extract abort information• Transform data structures• Unspecialize related regions

Abort

Guard/Assertion

Common case

Jython library function

Jython library function

Jython library function

!

!

Commit

Begin

Figure 3: Transactional memory-aided speculative specialization of Jython code.

ConcurrentHashMap, the iteration order may change undersome access patterns even though the dictionary contentsdo not.

By using hardware atomic regions for speculative lock elision(SLE) [13], uncontended access to a synchronized HashMapcan perform as well as the unsynchronized version. Synchro-nization overhead can be eliminated entirely when accessesare already subsumed by a larger atomic region, as will usu-ally be the case.2

3.3.2 Caching globalsWe next chose to address the overhead associated with ac-cessing Python module-level globals (typically defined at thetop level of a source file) and “builtins” (basic type names,constants, functions and exceptions such as int and None,which would be keywords in many other languages). Whilelocal variable references are resolved at compilation time(typically as an array access), global and builtin name ref-erences are looked up in a dictionary at runtime [19].3

Jython compiles each module (usually corresponding to asource file) to a Java class. It converts the Python code,

1 def f():2 g(x)

representing an invocation of a module function g from an-other function f with a module global variable x as a param-eter, into Java as:

2The results in Figure 4 were obtained on a single-core ma-chine; the speedup obtained using SLE is even greater on amulticore machine, where the JVM’s uniprocessor optimiza-tions do not apply.3Since globals shadow builtins, a builtin resolution involvesone unsuccessful (global) and one successful (builtin) hashprobe.

1 private static PyObject f$1(PyFrame frame) {2 frame.getglobal("g").__call__(frame.getglobal("x"));3 }

While the externally visible Python namespace representa-tion must remain a dictionary for compatibility, its inter-nal representation can be speculatively optimized for per-formance by exploting the characteristic access pattern ofdictionaries used for instance variable and namespace stor-age. Early in their lifetime, these dictionaries are populatedwith a set of name-value pairs. Thereafter, the values maychange but the set of names is unlikely to do so.

To take advantage of this behavior, we modified the Jythoncompiler to keep track of global and builtin references asthey are compiled, emit corresponding static variable dec-larations, and cache the global and builtin values in thesestatic variables during module loading. With this optimiza-tion, the code becomes:

1 private static PyObject global$g, global$x;2 private static PyObject f$1(PyFrame frame) {3 try {4 global$g.__call__(global$x);5 } catch (SpeculationException e) {6 frame.getglobal("g").__call__(frame.getglobal("x"));7 }8 }

We additionally subclass PyStringMap with a version thatredirects reads and writes to the module dictionary’s g andx keys to the corresponding static fields. This does slowdown access through this dictionary by unspecialized code(specialized code uses the fields directly), but since such ac-cesses are both infrequent and dominated by a hash tablelookup, this is a reasonable tradeoff.

Similarly, attempts to delete x must be trapped. In this case,

Page 6: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

we do not simply redirect the deletion but invalidate theatomic region in f$1, because there’s no way to represent thesemantics of Python deletion with a Java static field. If, forexample, we used the Java null value to represent a deletedvariable, by dereferencing a null global$g would generate aNullPointerException, but passing null to g would convert itinto a Python None object instead of producing the expectedPython NameError as required by Python semantics when anundefined variable is accessed.

3.3.3 Frame specializationPython allows extensive introspection of executing programstate, primarily for the purposes of debugging and exceptionhandling [20]. While these facilities differ little from those oflanguages, including Java itself, for which dynamic compila-tion replaced interpretation, Jython cannot rely on the JVMto reconstruct Python execution state when it is requested,so it must maintain the state itself during execution (VMextensions may offer alternate approaches, e.g. [3]). Thisentails a significant amount of wasted work in the commoncase, when this state is never accessed.

Jython maintains the Python execution stack as a linked listof frame (PyFrame) objects, one of which is passed to the Javamethod generated from each Python function or method.Frame objects expose local variables including parameters,global variables, exception information and a trace hook fordebuggers [21]; a simplified version of the PyFrame operationsperformed on a Python function call appears in Figure 1. Anadditional attribute represents the current Python bytecodeindex, though this is infrequently accessed and Jython doesnot maintain it.

While we have not yet implemented any automatic framespecializations in the Jython compiler, some of our earlyresults from frame elimination are presented in section 5.1.

4. EXPERIMENTAL METHODOur goal in these experiments was to quickly explore theperformance potential of dynamic language runtime opti-mizations enabled by transactional execution. We executedthree Python benchmarks, pystone, richards and pyparsing, ona real machine while simulating the effects of a HTM.

The first two of these benchmarks are commonly used tocompare Python implementations: pystone [1] consists ofinteger array computation written in a procedural style;richards [22] is an object-oriented simulation of an operat-ing system. Both are straightforwardly written but non-idiomatic Python, ported from Ada and Java, respectively;we expected them to uniformly exhibit common case behav-ior. In contrast, pyparsing [9] uses Python-specific features toimplement a domain-specific language for recursive descentparsers; we chose it as an example of code which is bothpotentially performance-critical and exploits the language’ssemantics.

Speedups gained from common case speculation rely on in-validation being an infrequent occurrence. While none ofthe the benchmarks’ code violated any specialization pre-conditions, pyparsing uses exceptions as a form of controlflow; Python exceptions inside specialized code cause thecontaining atomic region to abort. Since pystone and richards

JIT warmup

execution

valid

reg

ion

s

spec

ializ

atio

n

Figure 5: Typical atomic region invalidation profile.

did not raise exceptions, no conflict-related transactionalaborts would occur. Aborts would instead stem from con-text switches, I/O and system calls. Of the benchmarks,only pyparsing performs any I/O from Python code, and thefunctions containing I/O already executed unspecialized be-cause they indirectly raise exceptions.

Because the atomic regions in our benchmarks thus dividecleanly into “always abort” and “never abort” groups, wecan approximate the steady state performance—i.e., per-formance after no more invalidations occur—of a single-threaded workload on a HTM-enabled JVM with a real ma-chine without HTM, akin to measuring timings after JITwarmup. We accomplished this by modifying the compiler’soutput as if the“always abort”regions had been permanentlyinvalidated, so that the original code always executed.

As we had no capability to roll back execution in thesereal-machine experiments, we configured the JVM excep-tion handlers associated with atomic regions to abort theprocess if triggered. With an actual HTM, the exceptionhandlers would execute recovery code instead; subsequentexecutions would take the nontransactional fallback path.Recovery and nonspeculative code was still present in therunning system, such that any performance loss attributableto code size growth should be accounted for by these results.

For the optimizations presented thus far, specialization isperformed once, in the Jython compiler. As code which vi-olates specialization assumptions is encountered during ex-ecution, the corresponding regions are permanently invali-dated. Many performance-critical dynamic language appli-cations are long-running servers, whose invalidation profilesshould resemble Figure 5. Most invalidations would occurearly in execution as code is first encountered and JIT is stillactively compiling.

Just as “server” JITs’ aggressive compilation strategies slowstartup, early invalidation-related slowdowns are likely to bean acceptable tradeoff for higher sustained performance. Inpractice, HTM rollback latency would be unlikely to mat-ter: the hardware rollback time would be overshadowed byrecovery and fallback code execution. It would be possi-ble to construct pathological cases in which region invalida-tion occurred frequently throughout execution, thus makingthe recovery path performance-critical, however for ordinaryprograms (and our benchmarks) this is extremely unlikely.

We selected two JVMs which execute Jython the fastest:

Page 7: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

BEA JRockit (build R27.3.1-1-85830-1.6.0_01-20070716-1248-linux-ia32) and Sun HotSpot Server (build 1.6.0_02-b05). TheJVM extensions we propose are not specific to any particularimplementation; with the exception of honoring invalidationrequests in recovery code, the JVM is permitted to executeany combination of specialized and unspecialized code foroptimal performance.

Experiments were run on an Intel Core 2 Duo E6700 withLinux kernel 2.6.9-55.0.12.EL, using a single core in 32-bitmode. We used Jython’s “newcompiler” and modern branch(r4025). To account for JVM warmup and execution varia-tion, Jython speedups are based on the minimum runtimeof ten benchmark executions per session over five JVM exe-cution sessions.

For comparison, we include results from the CPython 2.4.3interpreter and Psyco 1.5.1. Psyco, a just-in-time specializerfor CPython [14], executes these benchmarks faster than anyother Python implementation, at the expense of imperfectlanguage compatibility and potentially unbounded memoryuse. As Psyco uses a simple template-based code generator,the Psyco results represent a lower bound on the perfor-mance of a Python-specific just-in-time compiler.

One important and unexplored question regarding these op-timizations is whether atomic region boundaries, represent-ing explicit speculation’s scope, can be established duringcode generation—for example, at all function boundariesand loop back edges—or must be adjusted at runtime foroptimal performance. In order to answer this question, weplan to measure transaction footprint and abort behaviorwith a full-system HTM simulation. Our previous workwith dynamic language runtimes in simulated HTM environ-ments, including a transactional Psyco executing the samebenchmarks we present here, has not identified transactionalcache overflows as a common issue. While it is possiblethat Java’s memory access patterns may differ enough toaffect the transaction footprint, existing work using hard-ware atomic regions for specialization in Java suggests thatcareful tuning of region sizes is not essential [10].

5. RESULTSWe express Jython performance with and without specula-tive optimizations as speedups relative to interpreted CPy-thon execution, as this is the baseline which would make themost sense to Python users. Performance differs substan-tially between the tested JVMs, whose just-in-time compil-ers were independently implemented; HotSpot always out-performs JRockit except in pyparsing.

Figure 6 includes the automated speculative optimizationswe implemented in Jython, as described in section 3.3: spec-ulative lock elision for dictionary access (HashMap) and elim-ination of hash table lookups in global and builtin access(cache_globals). Performance improvement varies accordingto the nature of each benchmark, which we now discuss inturn.

pystone contains primarily computation and function dis-patch. Jython’s baseline pystone performance already ex-ceeds CPython’s because numerical operations are encodedefficiently as their Java equivalents. Python function calls’

flexible semantics are costly to implement in Jython, as theyare in CPython, but because Python functions are declaredas module-level globals, caching these values eliminates hashtable lookups from each function invocation.

richards benefits comparatively less from global caching be-cause most of its dispatch overhead is in the form of instancemethod invocations, which involve dictionary access. In ad-dition, richards uses the dictionary storage mechanism forthe instance variables of its primary data type. Eliminationof synchronization overhead from these frequent dictionaryaccesses is responsible for the majority of the performanceimprovement. With a few more simple optimizations Jythoncan even outperform Psyco on the richards benchmark, asdiscussed in the following section.

The pyparsing results demonstrate the potentially large base-line performance gap between Jython and interpreted execu-tion. While pyparsing includes both object-oriented dispatchand frequent dictionary access, execution is dominated bycomplex control flow. In particular, pyparsing uses Pythonexceptions to implement backtracking. Python exceptionsin Jython incur a double penalty: in stock Jython, the com-plex Python exception state is re-created at the beginningof a Jython exception handler, regardless of whether or notit is used. pyparsing’s exception handlers do not examine theexception state, thus this behavior causes needless object al-location. The majority of the methods in pyparsing do indeedeither raise an exception or invoke a method which does, sospeculative versions are rarely executed and little benefit isachieved.

5.1 More richards optimizationsTo explore the performance potential of Jython with moreaggressive speculative optimization, we manually appliedseveral additional transformations to the richards benchmark,in addition to the HashMap and cache_globals optimizationsdiscussed in the previous section. These changes target theprimary bottleneck in richards: accesses to objects of classTaskState (representing processes in an OS scheduler simula-tion) and its subclasses. Figure 7 graphs the performanceimprovement.

Jython does not generate Java classes for user-defined Py-thon classes such as TaskState. Instead, instance methods ofthese classes are defined as static methods of a class corre-sponding to the module in which the class is defined. For ex-ample, TaskState.isTaskHolding(), which returns its task_holdingattribute, compiles to:

1 public class richards {2 private static PyObject isTaskHolding$22(PyFrame frame) {3 return frame.getlocal(0).__getattr__("task_holding");4 }5 ...

task-class. Optimize access to TaskState attributes by storingthem in Java instance variables rather than in a PyStringMap.Attempts to access the dictionary directly are redirected tothe instance variables, as with the cache_globals optimiza-tion (Section 3.3.2); recovery is only necessary when thedefault attribute access behavior is overridden. If the castto TaskState fails, a Java exception is raised which triggers a

Page 8: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

HashMap+cache_globalsHashMapunmodified

pystone richards pyparsing0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Sp

eed

up

vs.

CP

yth

on

6.00

Jython/JRockitJython/HotSpotPsyco

Figure 6: Jython and Psyco speedups over CPython on three Python benchmarks.

task-classvirt-framesvirt-methods

Psyco

Jython/JRockit Jython/HotSpot Psyco0.8

1.0

1.2

1.4

1.6

1.8

2.0

Sp

eed

up

vs.

CP

yth

on

Figure 7: Hand-optimized Jython and Psycospeedups over CPython on the richards benchmark.

rollback; the original code will then generate the Pythoni-cally correct TypeError.

1 public class TaskState extends PyObjectDerived {2 PyObject task_holding;3 /∗ rest of class ∗/4 }5 private static PyObject isTaskHolding$22(PyFrame frame) {6 try {7 return ((TaskState)frame.getlocal(0)).task_holding;8 } catch (SpeculationException e) {9 /∗ recovery/nonspeculative code ∗/

10 }11 }

virt-frames. For TaskState methods implemented as static meth-ods, eliminate the intermediate PyFrame argument used toencapsulate the method’s parameters and local variables.An atomic region surrounds the call site rather than thecallee. Recovery is needed if a method is redefined; anyatomic region containing code which directly references thestatic method must be invalidated. The PyFrame version of

the method remains for compatibility with separately com-piled code.

1 private static PyObject isTaskHolding$22(TaskState t) {2 return t.task_holding;3 }4 private static PyObject isTaskHolding$22(PyFrame frame) {5 /∗ nonspeculative code ∗/6 }

virt-methods. Move the TaskState method implementations intothe TaskState class. While one might expect invoking a staticmethod with a class’s instance as its first argument wouldbe as fast as an instance method on that class, the latterwas faster with both tested JVMs.

1 public class TaskState extends PyObjectDerived {2 /∗ rest of class ∗/3 public PyObject isTaskHolding() {4 return task_holding;5 }6 }

6. CONCLUSIONIn this paper, we have demonstrated that hardware trans-actional memory systems can facilitate the development ofhigh-performance dynamic language implementations by ex-posing a simple interface for expressing common case specu-lation. A low-overhead hardware checkpoint/rollback mech-anism can offer a largely transparent form of runtime adap-tation, while the simplicity of common case code is an effec-tive target for existing dynamic optimization systems.

7. ACKNOWLEDGMENTSThe authors wish to thank Paul T. McGuire for provid-ing the Pyparsing Verilog parser benchmark and Jim Baker,Philip Jenvey and Tobias Ivarsson for assistance with Jython.Pierre Salverda and the anonymous reviewers offered valu-able suggestions.

References[1] Pystone benchmark. URL http://svn.python.org/

projects/python/trunk/Lib/test/pystone.py.

Page 9: Transactional Runtime Extensions for Dynamic …web.engr.illinois.edu/~zilles/papers/python_on_htm.epham08.pdf · Transactional Runtime Extensions for Dynamic Language Performance

[2] J. Baker et al. The Jython Project. URL http://

jython.org/.

[3] J. Clements. Portable and high-level access to the stackwith Continuation Marks. PhD thesis, NortheasternUniversity, 2006.

[4] J. Hugunin. A Dynamic Language Runtime(DLR). URL http://blogs.msdn.com/hugunin/

archive/2007/04/30/a-dynamic-language-runtime-

dlr.aspx.

[5] J. Hugunin et al. IronPython: a fast Python im-plementation for .NET and ECMA CLI. URL http:

//www.codeplex.com/IronPython.

[6] A. Kuchling. Beautiful Code, chapter 18, Python’s Dic-tionary Implementation: Being All Things to All Peo-ple. O’Reilly, 2007.

[7] D. Lea et al. JSR 166: Concurrency Utilities. URLhttp://jcp.org/en/jsr/detail?id=166.

[8] M. Maly. Building a DLR Language - DynamicBehaviors 2. URL http://blogs.msdn.com/mmaly/

archive/2008/01/19/building-a-dlr-language-

dynamic-behaviors-2.aspx.

[9] P. McGuire. Pyparsing: a general parsing module forPython. URL http://pyparsing.wikispaces.com/.

[10] N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan,and C. Zilles. Hardware atomicity for reliable softwarespeculation. In Proceedings of the 34th Annual Inter-national Symposium on Computer Architecture, June2007.

[11] C. Nutter et al. JRuby: a Java powered Ruby imple-mentation. URL http://jruby.codehaus.org/.

[12] E. Phoenix et al. Rubinius: The Ruby virtual machine.URL http://rubini.us/.

[13] R. Rajwar and J. R. Goodman. Speculative lock elision:Enabling highly concurrent multithreaded execution. InProceedings of the 34th Annual IEEE/ACM Interna-tional Symposium on Microarchitecture, Dec. 2001.

[14] A. Rigo. Representation-based just-in-time specializa-tion and the Psyco prototype for Python. In PEPM’04,August 2004.

[15] A. Rigo and S. Pedroni. PyPy’s approach to virtualmachine construction. In OOPSLA Dynamic LanguagesSymposium, Portland, Oregon, October 2006.

[16] J. Rose. Anonymous classes in the VM. URLhttp://blogs.sun.com/jrose/entry/anonymous_

classes_in_the_vm.

[17] J. Rose et al. JSR 292: Supporting Dynamically TypedLanguages on the JavaTM Platform. URL http://jcp.

org/en/jsr/detail?id=292.

[18] G. van Rossum. Unifying types and classes in Py-thon 2.2, April 2002. URL http://www.python.org/

download/releases/2.2.3/descrintro/.

[19] G. van Rossum et al. Python reference manual: Nam-ing and binding. URL http://docs.python.org/ref/

naming.html.

[20] G. van Rossum et al. Python library reference: The

interpreter stack. URL http://docs.python.org/lib/

inspect-stack.html.

[21] G. van Rossum et al. Python reference manual: Thestandard type hierarchy. URL http://docs.python.

org/ref/types.html.

[22] M. Wolczko. Benchmarking Java with Richards andDeltaBlue. URL http://research.sun.com/people/

mario/java_benchmarking/.