1 - introduction to compilers - bath - introduction to compilers.pdf · the machine on which the...

Introduction to Compilers

A.G. - Material fromThe Essence of Compilers by Robin Hunter (Prentice

Hall)

Compiler

2 THE COMPILATION PROCESS

of source code to object code is usually referred to as the compilation process and is the process perfonned by the compiler. A language compiler may also be referred to as an implementation of the language. The object code produced by a compiler may be in the fonn of machine code for sorne machine (computer) or assembly code, or possibly sorne intennediate code, to be further transfonned (by other tools) into assembly code or machine code for sorne machine. Altematively, the intennediate code may be directly executed by means of an interpreter.

This text is principally about the compilation process, but the opportunity will be taken to discuss other applications of textual analysis as well. As far as compi-lation is concemed, we will spend more time on analysis than on synthesis, because of the greater generality and applicability of the ideas involved in the analysis stage of compilation, compared with the relatively ad hoc and machine-dependent issues involved in synthesis.

Compiler technology has advanced considerably since the early days of com-puters and it is now possible to automate the production of compilers to a large extent using widely available tools to produce analysers, at least, though the auto-matie production of code generators is less well advanced. A common theme in the chapters to come will be the extent to which compiler production can be auto-mated, and we will make considerable use of the analyser generator tools Lex and YACC.

Since the language nonnally used to implement compilers using Lex and Y ACC is C, we will nonnally think of Cas the implementation language (the lan-guage in which the compiler is written) and describe algorithms in C. In order to avoid having to switch from one language to another too often, we will also tend to think of the language being implemented asC. However, we will also use other languages to illustrate specifie points, as appropriate.

1.3 The compilation process

The compilation process is essentially a transfonnation from one language to another, from the source code in which the programmer writes, oris generated automatically from sorne hÎgher-level representation, to the object code, which is executed on the machine, possibly after sorne further transfonnation. The situa-tion is shown diagrammatically in Figure 1.1.

As has been mentioned, the compiler will also involve a third language, the implementation language. This may be the same language as the source code or the object code, but need not be. If possible, the compiler should be

Source Object

- __ co_d_e __ __ • .. : Compilation process •

Figure 1.1

T-diagram

THE COMPILATION PROCESS 3

written in a language that good for writing compilers in, either because of the language's intrinsic merits, such as lack of error-proneness, or its availability and compatibility with development tools.

lt is convenient, as we will see, to represent the three languages involved in an implementation, by means of aT -diagram showing each of the languages in a dif-ferent arm of the T. Figure 1.2 represents a compiler that is written in C and trans-lates Java into Bytecode (the language interpreted by the Java Virtual Machine).

Figure 1.3, on the other hand, represents a Pascal compiler written in M-code and producing M-code. This example illustrates the fact that an operational com-piler will normally be written in, and produce code for, the machine on which it will run. In sorne circumstances, however, a piece of software, perhaps for an embedded system, will be compiled on a different machine from the one on which it is intended torun. In this case there are two machine codes involved, the one for the machine on which the compiler is run (the language in which the compiler will be written), and the one for the machine on which the software will run (the lan-guage which will be generated by the compiler). For example, Figure 1.4 might represent the compiler used for compiling software for embedded systems.

While an executable compiler has to be in the code of the machine on which it is being run, this code will very often not be a suitable one for compiler writing, since it is likely to be at a very low level. The normal way to obtain a program in a low-level language, of course, is to compile it from a high-level language, and compilers themselves are usually written in high-level languages, and then

Java Bytecode

c

Figure 1.2

Pascal M-code

M-code

Figure 1.3

A compiler that is written in C and translates Java into Bytecode (the language interpreted by the Java Virtual Machine).

Building compilers

• How do we write the first compiler?

• How do we port one compiler from a machine to another?

Building compilers

4 THE COMPILATION PROCESS

C++ T-code

H-code

Figure 1.4

compiled by means of a pre-existing compiler in to their executable form. Here we have an example of a compiler, the pre-existing one, having compilers both as its input and its output. This can be represented by three adjacent T -diagrams, as shown in Figure 1.5.

The top left compiler is the compiler as it was originally written in its own lan-guage, C++, and the top right compiler is the executable compiler, obtained after it bas been compiled by the bottom compiler. T -diagrams may be joined together in this way to show how one compiler may be obtained from another, as long as certain consistency rules are observed. For example, the two languages in the cor-responding top two positions in the uppermost compilers must be the same, and the two adjacent occurrences of C++ and M-code must each refer to the same lan-guage. The bottom occurrence of M-code could in fact be any language, but would dictate the machine on which the compiler is compiled.

T -diagrams may be used to illustrate how a compiler may be ported from one machine to another. Given a compiler that runs on machine A, the implementation language at least will need to be changed in order for it to run on machine B. Rather than attempting to translate the machine code of one machine into the machine code of another, it will probably be simpler to go back to the original version of the compiler written in a high-level language and compile this into machine code for B.

In many cases it will also be necessary to change the code output by the compiler- presumably to be the code of machine B. This is another matter, and

C++ M-cbde C++ M-code

C++ C++ M-code M-code

M-code

Figure 1.5 Example of a compiler having compilers both as its input and its output.

A more sophisticated compiler for the full language can be built on top of a simple compiler for a subset of the language.

This can be repeated as necessary.

• How do we write the first compiler?

Building compilers

• How do we port one compiler from one machine to another?10 THE COMPILATION PROCESS

rn languages

UIL

n machines

Figure 1.8

As far as code optimisation is concemed, the need for it is varied. If very efficient code is required, then extensive optimisation will be performed by the compiler. However, in many environments, the execution speed of the software is not critical and little optimisation will be required. It tums out that sorne optimi-sations are cheap to perform and are often included in compilers, whereas others, especially global as opposed to local forms of optimisation, are expensive in terms of both time and space at compile time, and are rarely performed. Many compilers provide a flag that the user canuse to indicate whether extensive (and expensive) optimisation should be performed or not.

Every constant and variable appearing in the pro gram must have storage space allocated for its value during the storage allocation phase. This storage space may be one of the following types:

• Static storage if its lifetime is the lifetime of the pro gram and the space for its value once allocated cannat later be released.

• Dynamic storage if its lifetime is a particular black or function or procedure in which it is allocated so that it may be released when the black or function or procedure in which it is allocated is left.

• Global storage if its lifetime is unknown at compile time and it has to be allocated and deallocated at run time. The efficient control of such storage usually implies run-time overheads.

After space is allocated by the storage allocation phase, an address, containing as much as is known about its location at compile time, is passed to the code generator for its use.

The synthesis stage of compilation, unlike the analysis stage, is not so well suited to automation, and tools to support its production are not so widely available. The early idea of a compiler-compiler, a piece of software whose input was the specification of a language and a machine, and whose output was

UIL = Universal Intermediate Language

A UIL would considerably simplify the problem, but defining it proved to be elusive. We can try though ...

Structure of a compiler

STAGES, PHASES AND PASSES 5

may or may not be simple, depending, among other things, on how self-contained the code production aspects of the compiler are.

1 .4 Stages, phases and passes

A well-written compiler is highly modular in design, and should present a good example of a well-structured program. Logically, the compilation process is divided into stages, which are in tum divided into phases. Physically, the com-piler is divided into passes. We will describe these terms in more detail.

As we have seen, the principal (and often the only) stages that are represented in a compiler are analysis, in which the source code is analysed to determine its structure and meaning, and synthesis in which the object code is built or synthe-sised. ln addition, however, there may be a pre-processor stage in which source files are included, macros expanded and so on. This stage is usually fairly straightforward and is mainly relevant to the languages C and C++. We will not consider it in detail.

Figure 1.6 shows the typical phases of the compilation process. The analysis stage is usually assumed to consist of three distinct phases:

1. Lexical analysis. 2. Syntax analysis. 3. Semantic analysis.

Lexical . Syntax Semantic anal y sis -. analysis --. analysis

Analysis

Machine Optimisation Machine Optimisation independent of machine Storage code of machine code independent allocation generation code generation code

Synthesis

Figure 1.6

Lexical analysis

USE OF TOOLS 13

as output a lexical analyser (a program in C for example) for the language. This is illustrated in Figure 1.9.

A syntax analyser generator takes as its input the syntactical definition of a lan-guage, and produces as output a syntax analyser (a program in C for example) for the language. This is illustrated in Figure 1.10.

Parser generators have been developed which support most popular parsing methods, but probably the most widely known is the Unix-based Y ACC, which is

Character sequence

.. Lexical analysis Lexical r generator analyser Lexical

structure

Symbol sequence

, Figure 1.9

Symbol sequence ,,

Parser generator Parser Syntactic definition

Parse tree

Figure 1.10

Syntax analysis

USE OF TOOLS 13

as output a lexical analyser (a program in C for example) for the language. This is illustrated in Figure 1.9.

A syntax analyser generator takes as its input the syntactical definition of a lan-guage, and produces as output a syntax analyser (a program in C for example) for the language. This is illustrated in Figure 1.10.

Parser generators have been developed which support most popular parsing methods, but probably the most widely known is the Unix-based Y ACC, which is

Character sequence

.. Lexical analysis Lexical r generator analyser Lexical

structure

Symbol sequence

, Figure 1.9

Symbol sequence ,,

Parser generator Parser Syntactic definition

Parse tree

Figure 1.10

1 - introduction to compilers - bath - introduction to compilers.pdf · the machine on which the...

Documents