coms 6998 final paper

8/2/2019 COMS 6998 Final Paper

1/19

Compiling Python to C: An Introduction to RPython

[my name redacted] for Alfred Aho

Advanced Topics in Programming

Languages and CompilersColumbia University


2/19

Part 1 - A Brief Introduction to Python

Python is an bytecode-interpreted language, initially designed by Guidovan Rossum in 1991 and presented to the alt.sources discussion board as

a language meant to interface with the Amoeba operating system. Since

then, it has grown to be one of the most popular languages in common

use today, featuring extensive library support and a massive developer

community. (Authors, 2011)

Python programs can either both compiled and executed as bytecode, or

interpreted in a console shell. For the purposes of this paper, we will only

consider the portion of the toolchain that deals with compilation and

bytecode interpretation.

The Python Language

A Demonstration Program

Here is a complete Python program consisting of an implementation of

Euclids algorithm, followed by a short driver.

def euclid(a, b):

if a < b:return euclid(b, a)

while b != 0:

t = b

b = a % b

a = t

return a

if __name__ == __main__:

print euclid(549, 129)

While language grammar is not the topic of this paper, a few points standout because they demonstrate the philosophy of simplicity that runs

throughout the reference CPython implementation.

Firstly, there are no visible line delimiters. In the place of semicolons, we

find newline characters dividing the program. Secondly, there are no

braces to mark blocks. Instead, a standard indentation scheme is used, in


3/19

this case four spaces. Finally, the parentheses typically seen around loop

and conditional checks in C and friends are absent.

Together, these points make for a language that is meant to be simple to

write and simple to comprehend.

Notable Language Semantics

In Python, almost everything happens at runtime. All decisions about

name binding and object creation occur when the program is run, and as

a result allow for some very interesting semantics.

No Declarations

Unlike C, C++, Java, and a great many other languages, we are not

required to declare our variables. Instead, we simply assign a name to a

value, and use it as we see fit. If the name is accessed after it has been

created, its value is simply updated.

In fact, all variable assignments occur in this way. When a variable is not

present in the current namespace, the interpreter creates a new name

and assigns to it whatever value is being assigned. A name can also be

deleted using the del keyword, removing it from the current namespace.

In a sense, the state of a Python program can be thought of as the set of

name to object mappings, in which case a program is efectively a means

of ensuring that the desired name is mapped to appropriate object.

No Types

Just as we never declared any variables, we also never specify the type of

any variables. The euclid function takes two parameter, but we do not

specify their types at compile time. Instead, the types are implied by the

operations we perform on the variables.

For instance, the euclid function clearly lends itself to numerical types,

since it features operations that are typically associated with numbers.

We see comparison, comparison with zero, and the modulo operation.

The precise type of the numerical arguments is not known. We could pass

in ints, longs, or even integer-valued floats, and the operations would

succeed.


4/19

This type-freedom goes further, however. If we were to pass in instances

of a class that was jury-rigged to implement all of those operations, the

algorithm would run. In fact, any variable that supports these operations

will happily pass through this algorithm.

The crucial point is that the entire Python type system relies on runtime

discovery of what operations are available. If the above function were

passed a list, it would happily accept the variable, push the function

object onto the runtime stack, and begin executing it. Only when it

attempts to perform the modulo operation would it discover that the

object passed is not valid and raise an exception. (Incidentally, while one

might expect the comparisons to be the first operations to fail, Python

actually implements comparisons between lists.)

This runtime discovery of object capabilities will become very importantin a few sections, when we discuss Object Spaces.

Everything is an Object

Upon close inspection of the program, a reader might object to the

statement of no declarations. After all, what is the def euclid(a, b):

doing if not declaring a function?

While the def syntax does borrow from languages that declare their

functions, in Python it is actually a statement whose side efect is thecreation of a function object containing the relevant code, and the

binding of the name euclid to that object in the containing namespace.

This is a crucial distinction, because it means that function objects can be

passed around as arguments, assigned to variables, and even deleted.

For instance, the following code can be appended to the above example

to cause all subsequent calls to euclid to be executed recursively rather

than iteratively:

def euclid_rec(a, b):if a < b:

return euclid(b, a)

if b == 0:

return a

else:

return euclid_rec(b, a % b)


5/19

euclid = euclid_rec

# performs recursive implementation

print euclid(100, 20)

In a similar way, class declarations are in fact statements that result in

the creation and naming of a class object:

class hello:

def __init__(self):

self.message = Hello World

def print_message(self):

print self.message

These semantics of classes-and-functions-as-objects makes for some

interesting and powerful capabilities. Functions can return custom-made

classes for specific purposes. For instance, operating system-specific

initialization typically uses this mechanism to construct file and system-

handling interfaces composed of the methods supported by the running

system. For instance, a file handling module might attempt to load a

Windows-specific filesystem interface, fail, and load a Linux-specific one

instead.

Everything is a Namespace

In addition to being passed around, objects support the same namespace

operations as the global namespace. They support name binding,

rebinding, and deletion. For instance, in the above class declarationdemonstration, the__init__method is attaching themessage member

to that instances namespace. In turn, each of the def statements is

nothing more than the familiar creation of a function object with a

binding to the enclosing namespace.

As such, object members can be modified at will. One common design

pattern that utilizes this capability involves replacing an objects handlers

with hooks for that function. For instance, logging of critical function

calls can be implemented as follows:

def log_function(f):

log(About to call +str(f))

f()

handler = event_handler()

handler.f = log_function(f)


6/19

The handler object will now log all calls to the f function. To disable this

behavior, the f function can be replaced with its original value.

Built-ins

There is a notable exception to this namespace convention, however.Built-in objects do not support the same namespace operations as other

classes. These are objects that are implemented at the interpreter level,

and as a result have limited functionality by design.

Conclusion

The dynamism of Pythons object system makes it very flexible. Since

objects are, at their heart, collections of name to object mappings, they

can be manipulated at runtime to perform any functionality desired. This

allows for some very interesting usage patterns, but does pose someproblems with translation, as we will soon see.

Part 2 - Python Interpreter Semantics

Having seen the flexibility of Pythons objects, we might wonder how

these constructs are implemented. We have seen that there are some

limitations to the object model, namely the rigidity of the built in objectsand methods, and that the ban on modifying operations on these objects

is an implementation detail of the interpreter and virtual machine.

The CPython Interpreter

As we make our way toward a discussion of the features and techniques

used to implement RPython, let us first discuss notable implementation

details of the CPython interpreter. The features and design patterns we

find here will again reveal themselves when we arrive at our destination.

Interpreter and VM Basics

After a Python program has been compiled to bytecodes by the compiler,

the output is passed to the virtual machine. Control flow is handled by

the interpreter, which is in essence a large switch statement that grabs a

bytecode from the program and chooses the appropriate handler. These

handlers manipulate the state of the VM itself.


7/19

Separation of Control Flow and Object Semantics

At the language level, there is a sort of implicit distinction between the

control flow of the program and the operations of the objects themselves.

Let us take a closer look at the implementation of Euclids algorithmintroduced in the previous section:

def euclid(a, b):

if a < b:

return euclid(b, a)

while b != 0:

t = b

b = a % b

a = t

return a

Let us forget what we intuitively know about numerical operations, and

remember that any object can be made to support these operations

(regardless of whether the operation makes sense or not). When we look

closely at this function, we find that we can treat the objects themselves

as black boxes whose operations are unknown to us. For each operation,

suppose we give the objects the benefit of the doubt and assume that

they will respond favorably to our attempts to perform operations.

Once we make this assumption, we are left with nothing more than naked

control flow. Let us strip away the syntax surrounding binary operationsand reveal what the proper semantics of the language see:

def euclid(a, b):

if a.operation1(b):

return euclid.__call__(b, a)

# suppose __zero is an object with value zero

while b.operation3(__zero):

t = b

b = a.operation4(b)

a = t

return a

This semantic understanding perfectly captures the runtime binding

semantics of the language. Each of blandly named operation methods

belongs to an object, whose type we do not know until the last possible

moment. The interpreter is responsible for maintaining control flow,


8/19

implementing assignment by manipulating the namespace, and

instructing objects to perform operations.

The objects themselves, on the other hand, are responsible for either

performing those operations, or issuing an exception in the event that

they cannot. It is worth noting that the exceptions raised by the objectswhen they do not support an operation are no diferent from exceptions

thrown from user code. As such, they can be caught and handled by the

interpreter.

Example with Disassembly

To demonstrate the ignorance with which the interpreter approaches

operations, here is a disassembly of the original euclid function as

provided by the CPython implementation. The particular disassembly

varies between Python versions and implementations, but the theme ofthe division between flow control and object semantics remains.

In the disassembly on the next page, the leftmost numbers correspond to

the line in the original python code that generated the opcodes, and the

opcodes are listed in uppercase. Names are compiled down to numeric

handles, which appear to the right of the opcodes. The parenthesized

annotations indicate which name in the original program the numeric

identifiers correspond to.

The originating lines are included before each block for clarity, althougha raw disassembly would not contain this information. Additionally, the

>> indicates an instruction to which control may jump. These can be

considered the delimiters of basic blocks.

Also note that, as can be expected of a stack-based machine,

instructions are triples, as evidenced by the fact that the instruction

address increments by 3 bytes instead of 4, as it commonly seen in

register-based machines.

if a < b:2 0 LOAD_FAST 0 (a)

3 LOAD_FAST 1 (b)

6 COMPARE_OP 0 (


9/19

18 LOAD_FAST 0 (a)

21 CALL_FUNCTION 2

24 RETURN_VALUE

while b != 0:

4 >> 25 SETUP_LOOP 38 (to 66)

>> 28 LOAD_FAST 1 (b)31 LOAD_CONST 1 (0)

34 COMPARE_OP 3 (!=)

37 POP_JUMP_IF_FALSE 65

t = b

5 40 LOAD_FAST 1 (b)

43 STORE_FAST 2 (t)

b = a % b

6 46 LOAD_FAST 0 (a)

49 LOAD_FAST 1 (b)

52 BINARY_MODULO

53 STORE_FAST 1 (b)a = t

7 56 LOAD_FAST 2 (t)

59 STORE_FAST 0 (a)

62 JUMP_ABSOLUTE 28

>> 65 POP_BLOCK

return a

8 >> 66 LOAD_FAST 0 (a)

69 RETURN_VALUE

This disassembly clearly illustrates the ignorance of the interpreter to the

objects internals. The parameters and all objects created in the functionare included in the functions stack frame as a list of references indexed

by an integer ofset. The interpreter does not know their types, and

neither does it care. The only scenario that could perturb it is an

uncaught exception, which is handled by quitting with an error.

For instance, note that the LOAD_FAST opcode takes as a parameter an

ofset into the object array of the function frame. Once those objects are

pushed onto the stack, the COMPARE_OP opcode is responsible forinstructing the objects on the stack to be compared to one another, again

without any awareness of type.

This division yields the notion of the Object Space. An object space can

be though of as an object-level implementation of the application

interface that is presented to the interpreter. This split between


10/19

interpreter space and object space is a crucial one, because various

object spaces can be used for various purposes, allowing the interpreter

to drive either an actual execution, or a more abstract implementation, as

we will see when we reach RPythons translation framework.

Garbage Collection

Python is a garbage collected language, meaning that the runtime must

devote resources to maintaining an awareness of the liveness of its

objects. Up until version 2.0 of the CPython interpreter, this task was

handled by using reference counting. However, reference counting

sufered from a critical weakness in the form of an inability to detect

reference cycles. For instance, consider the following code:

lst = [] # lst count is 1

lst.append(lst) # lst count is 2del lst # lst count is 1 - no deletion

In this snippet, lst is a list containing a reference to itself. When the del

operator is called, the reference count is decremented to one, so it is not

garbage collected. However, there is now no name that points to the

object, either directly or indirectly, which means the object is now

garbage.

On the face of it, it would be natural to simply implement a traditional

garbage collector, such as mark and sweep. These approaches work byfinding the root of the object reference graph, traversing the graph and

marking the found objects as alive, and garbage collecting the rest.

However, CPython supports extension modules written in C, which means

that determining the root of the object graph is not possible for those

extensions, since there is no C interface for reporting objects created by

C-language extensions. As a result, the CPython garbage collection

scheme is a combination of reference counting and a period cycle

detector. (Schemenauer, 2000)

Part 3 - Enter RPython


11/19

Now that we have discussed the features of the CPython implementation,

we us turn our focus toward the RPython compiler.

RPython is a strict subset of the Python language. The goal of the project

is to develop a dialect of Python that can support whole-program staticanalysis. This efort was launched in order to develop a toolkit for the

construction of virtual machines for dynamic languages, such as Python.

With this toolkit, a developer could specify his program in a high level

language, namely RPython, and have it compiled down to some lower-

level language for fast execution.

While RPython currently supports backends for the Java Virtual Machine

and Common Language Runtime, the C backend is the most stable and

well-developed. Because of this and the universality and approachability

of C, we will restrict ourselves to discussion of the C backend for thepurposes of this paper.

Before we describe the proper translation toolchain, let us first consider

the obstacles to translating Python to C. (Rigo, Hudson, & Pedroni)

Dynamic Features Make Static Analysis Dicult

In general, it is impossible to prove almost anything about a Python

program. As we have seen, any construct that seems to resemble a

feature of a language designed to support static analysis in factgenerates a dynamically changing object. For instance, functions are

actually objects that contain code, and any object can be made to behave

as a function by adding a__call__method to its namespace. In order to

compile to C, we would have to be able to prove a classs members,

which can change at any time.

There is more trouble with classes, however. C is a strongly typed

language, meaning it requires the type of every expression in the

program must be known at compile time. Meanwhile, Python is a

language where the only thing that is known at compile time is controlflow.

What Python calls classes are actually namespaces which contain

references to objects. Any of these references can be mutated at runtime,

possibly as a result of some exponential computation. This, however

could be remedied by use of type inference. The real trouble lies not so


12/19

much in the exact contents of every class, but rather in the number of

potential classes.

The Number of Types is Unbounded

Since classes can be created at runtime, consider the following function:

def make_types(num):

ret = []

for i in range(0, num):

name = str(i)ret.append(type(class_ + name,

(object,),

dict(a_ + name=i))

return ret

This function creates some number of distinct classes using the type

function, and returns them in an array. Any method calling this function

would have at its disposal any number of types from which it could

choose.

If we are performing static analysis with the intention of translating to C,then we need to generate struct declarations for every structure our code

will use. However, this is clearly impossible in the case of code involving

this function.

Python functions enjoy the same sort of dynamic creation as classes. In C,

the prototype of a function defines the type of function pointers that may

point to the function. Furthermore, the addition of syntax for calling

objects via the__call__method introduces new complexity.

On the face of it, a function pointer could be kept in the object. However,since any function can be assigned to the__call__method, the

prototype of the function pointer must be either inferred, which poses a

significant challenge, or null, which introduces the risk of a translated

program getting issued a SEGFAULT on improper function invocation

rather than exiting gracefully.


13/19


14/19

However, if we restrict our code to only generate a provably-bounded

number of classes, we can identify the classes that would be created by

using data flow analysis. For instance, the PyPy interpreter must create

wrappers for various objects in the interpreter. Instead of manually

specifying a wrapper class for every object, the code instead defines a

function that generates a class for that object. The crucial point is thatthe number of these objects is bounded, and therefore the class-

generating function is also entered a bounded number of times, ensuring

that the translator does not spin through the function forever.

In addition to not creating new classes, existing classes cannot have their

contents modified after startup. In other words, the classs namespace is

considered to be constant after a certain point. Otherwise, each

successive alteration would have to be captured and modeled by a new

type in C, assuming the results of the alterations are decidable in the first

place.

A Special Note on Functions

In Python, any variable can point to any function. If a function call with an

improper number of arguments is attempted, the runtime would detect it

and issue an error. This is not the case in C, however. Only functionpointers of the appropriate type can point to a function, and an error of

mismatched arguments must be discovered at compile time, not at

runtime.

To solve this, RPython places a restriction on the use of function objects.

Functions are first class objects, but variables that hold them must only

hold functions that are deemed similar enough by the type inferencer.

The documentation does not currently define similar enough, although

it does promise that the toolchain will emit explicit errors and not

obscure crashes.

Globals

Global definitions correspond roughly to Cs static declarations. As such,

they their type and value must be known at compile time. A simple

restriction is placed on globals, namely that they are considered constant,

and cannot be modified after they are defined.


15/19

The Translation Path

Now that we have described the limitations placed on the language, we

can delve into the process of the translation itself. The RPython compiler

is written in Python, and is interpreted by a standard interpreter. For thepurposes of this paper, we assume that this interpreter is CPython.

For reference, here is a graphic representing the RPython translation path

(Krekel & Bolz, 2005):

Parsing

The RPython compiler is interesting in that it does not perform parsing

on the python text file. Instead, on receiving an input file, it uses the

interpreters compilation functionality to interpret the input file.

Eventually, the interpreter reaches the ready point, which consists of a

call into the entry point of the RPython compiler.

That call inspects the objects produced by the interpretation of the

initialization portion of the input code. This consists of code objects in

the form of functions, which can be disassembled using the built-in dismodule, and live Python objects, whose members can be retrieved by

Pythons native introspection features.

In essence, standard Python interpreter is used as a preprocessor for the

RPython language. The RPython compilers input is not the program

itself, but rather the partially-executed state of the program, as

generated by the CPython interpreter. This state takes the form of class


16/19

and function objects in memory, as well as any global variables whose

values are to be compiled to static declarations.

The Flow Object Space

At this point, the programs state constitutes an intermediaterepresentation amenable to abstract interpretation. The RPython compiler

is part of the PyPy Python interpreter project, and shares some code with

the proper PyPy interpreter. In particular, it borrows the interpreter to

handle interpretation of the newly-minted python bytecodes.

However, while PyPy uses a concrete object space to implement the full

spectrum of interaction between the interpreter and Python objects,

RPython uses a much simpler space called the Flow Object Space. This is

an object space that contains placeholder objects instead of fully featured

objects, and yet still receives the relevant requests for operations fromthe interpreter.

The aim of the flow object space is to generate flow graphs for the

program by way of interaction with the abstract interpreter. The abstract

interpreter goes over the entire program bytecode by bytecode, and

sends of requests for operations to the flow object space. The flow

object space, rather an servicing these requests, records the operations,

gradually building up a flow graph of the operations of the program from

the live code objects.

One might expect that branching is a problem with this scheme. After all,

if the interpreter sees a branch, it will only choose one direction in which

to go. The RPython documentation claims that the interpreter is tricked

into interpreting two sides of a branch at once, without going into detail.

With such a paucity of description, let us take the documentation at its

word.

Type Inference

Once the control flow graph of the entire program is available, the typeinferencer, called the annotator, can pass over the entre program and

infer the type of each variable. While the details of this type inferencer are

far beyond the scope of this paper, it suces to say that the inferencer

works by forward propagation, starting with the types input arguments of

the entry point function as a base case.


17/19

The annotator begins with specific types for each variable, and gradually

works up to the most general. The annotation lattices are shown in [big].

Variables can have change their type, but at no point may the types

diverge. In other words, after every branch is merged at a joinpoint, the

types of each variable must be the same. For instance, the following code

is forbidden: (Rigo, Hudson, & Pedroni)

if a == 1:

b = 10

else:

b = a string

# b has conflicting types here

Specialization

From this point forward, the program undergoes direct compilation downto the target environment. From the flow graph, a low level flow graph is

generated, conforming to either a low-level, C-like type system for the C

backend called lltypesystem, or an object-oriented type system for the

JVM and CLI backends, called ootypesystem. Given this lower level flow

graph, the appropriate code generator can generate target code.

Conclusion


18/19

The RPython language was developed to serve as a framework for the

specification of dynamic language virtual machines. The project itself is

called PyPy, as is the flagship Python interpreter.

The PyPy interpreter is written entirely in RPython, and be either

interpreted by standard CPython, or translated to C for release. As of thiswriting, the PyPy interpreter is considered to be the fastest Python

implementation available today, boasting speed increases over the

reference CPython implementation in excess of ten times for some tests.

(Authors, PyPy Speed, 2011) In addition, it is very compliant, including

almost all features of the CPython implementation.

The use of RPython as an implementation language and framework allows

the PyPy project to be written in a high-level language with concise

features, but be compiled to a low-level language for fast execution. The

high level specification has allowed for a very flexible architecture. Forinstance, while CPythons garbage collection scheme consists of manually

written reference counts, PyPys scheme can be chosen at compile time as

a flag.

In addition to Python, the VM specification framework is flexible enough

to allow specification of other languages. For instance, JS-PyPy is a

Javascript interpreter written in RPython and compiled using the RPython

compiler and VM toolkit. (Santagada)

Compilation of Python to C is a common question among Pythonbeginners, and while such a translation is made impossible by the

semantics of the general Python language, RPython shows that simple

restrictions can be placed on the language to make the translation

possible.


19/19

Bibliography

Authors, P. (2011, December 10). PyPy Homepage. Retrieved December10, 2011, from General Python FAQ: http://docs.python.org/faq/

general#why-was-python-created-in-the-first-place

Authors, P. (2011, December 10). PyPy Speed. Retrieved December 10,

2011, from PyPy Speed: http://speed.pypy.org/

Krekel, H., & Bolz, C. F. (2005, December 28). PyPy - The new Python

implementation on the block. Retrieved December 8, 2011, from PyPy

Homepage: http://codespeak.net/pypy/extradoc/talk/22c3/hpk-

tech.html

Rigo, A., Hudson, M., & Pedroni, S. Compiling Dynamic Language

Implementations . European Commission within the Sixth Framework

Programme .

Santagada, L. (n.d.). PyPy Homepage. Retrieved December 8, 2011, from

JS-PyPy: PyPy's Javascript interpreter: http://codespeak.net/svn/pypy/

lang/javascript/trunk/js/javascript-interpreter.txt

Schemenauer, N. (2000, December 6). Arctrix. Retrieved December 8,

2011, from Garbage Collection for Python: http://arctrix.com/nas/

python/gc/
http://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://speed.pypy.org/http://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://speed.pypy.org/http://speed.pypy.org/http://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-place

coms 6998 final paper

Documents