coursescript - uio.no

Course ScriptINF 5110: Compiler con-structionINF5110, spring 2020

Martin Steffen

http://www.ifi.uio.no/~msteffen

ii Contents

Contents

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Compiler architecture & phases . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Bootstrapping and cross-compilation . . . . . . . . . . . . . . . . . . . . . . 15

2 Scanning 212.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3 DFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.4 Implementation of DFAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.5 NFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.6 From regular expressions to NFAs (Thompson’s construction) . . . . . . . . 612.7 Determinization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.8 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.9 Scanner implementations and scanner generation tools . . . . . . . . . . . . 78

3 Grammars 803.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.2 Context-free grammars and BNF notation . . . . . . . . . . . . . . . . . . . 843.3 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.4 Syntax of a “Tiny” language . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.5 Chomsky hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4 Parsing 1094.1 Introduction to parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2 Top-down parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.3 First and follow sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.4 Massaging grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314.5 LL-parsing (mostly LL(1)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.6 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1594.7 Bottom-up parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5 Semantic analysis 2125.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125.2 Attribute grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

6 Symbol tables 2386.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2386.2 Symbol table design and interface . . . . . . . . . . . . . . . . . . . . . . . . 2396.3 Implementing symbol tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 2406.4 Block-structure, scoping, binding, name-space organization . . . . . . . . . 2466.5 Symbol tables as attributes in an AG . . . . . . . . . . . . . . . . . . . . . . 252

7 Types and type checking 2567.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

ContentsContents iii

7.2 Various types and their representation . . . . . . . . . . . . . . . . . . . . . 2597.3 Equality of types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2697.4 Type checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

8 Run-time environments 2798.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2798.2 Different layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2838.3 Static layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2838.4 Stack-based runtime environments . . . . . . . . . . . . . . . . . . . . . . . 2858.5 Stack-based RTE with nested procedures . . . . . . . . . . . . . . . . . . . 2988.6 Functions as parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3038.7 Parameter passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3088.8 Virtual methods in OO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3128.9 Garbage collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

9 Intermediate code generation 3219.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3219.2 Intermediate code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3259.3 Three address (intermediate) code . . . . . . . . . . . . . . . . . . . . . . . 3269.4 P-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3299.5 Generating P-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3319.6 Generation of three address code . . . . . . . . . . . . . . . . . . . . . . . . 3389.7 Basic: From P-code to 3A-Code and back: static simulation & macro ex-

pansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3439.8 More complex data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3489.9 Control statements and logical expressions . . . . . . . . . . . . . . . . . . . 356

10 Code generation 36710.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36710.2 2AC and costs of instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 37410.3 Basic blocks and control-flow graphs . . . . . . . . . . . . . . . . . . . . . . 37910.4 Code generation algo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39410.5 Global analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

1 Introduction 1

IntroductionChapter

Whatis it

about?Learning Targets of this Chapter

The chapter gives an overview overdifferent phases of a compiler andtheir tasks. It also mentions/organizational/ things related tothe course.

Contents

1.1 Introduction . . . . . . . . . . 11.2 Compiler architecture &

phases . . . . . . . . . . . . . 41.3 Bootstrapping and cross-

compilation . . . . . . . . . . 15

1.1 Introduction

This is the script version of the slides shown in the lecture. It contains basically all theslides in the order presented (except that overlays that are unveiled gradually during thelecture, are not reproduced in that step-by-step manner). Normally I try not to overloadthe slides with written information and rely on speaking and telling a story (where theslides are used a guidance. Such additional information, however, is presented in thisscript-version, so the document can be seen as an annotated version of the slides. Manyexplanations given during the lecture are written down here, but the document also coversbackground informationm, hints to additional sources, and bibliographic references. Someof the links or other information in the PDF version are clickable hyperrefs.

Course info

Sources

Different from some previous semesters, one recommended book the course is Cooper andTorczon [6] besides also, as in previous years, Louden [9]. We will not be able to cover thewhole book anyway (neither the full Louden [9] book). In addition the slides will drawon other sources, as well. Especially in the first chapters, for the so-called front-end, thematerial is so “standard” and established, that it almost does not matter, which book totake.

As far as the exam is concerned: Traditionally, it has always been a written exam, andit’s “open book”. This influences the style of the exam questions. In particular, there willbe no focus on things one has “read” in one or the other pensum book; after all, one canbring along as many books as one can carry and look it up. Instead, the exam will require

2 1 Introduction1.1 Introduction

to do certain constructions (analyzing a grammar, writing a regular expressions etc), so,besides reading background information, the best preparation is doing the exercises as wellas working through previous exams.

For spring 2020: the exam form has changed to oral (and pass/fail), that changes therules of the game and the above remarks don’t apply to 2020.

Course material from:

A master-level compiler construction lecture has been given for quite some time at IFI.The slides are inspired by earlier editions of the lecture, and some graphics have just beenclipped in and not (yet) been ported. The following list contains people designing and/orgiving the lecture over the years, though more probably have been involved, as well.

• Martin Steffen ([email protected])• Stein Krogdahl ([email protected])• Birger Møller-Pedersen ([email protected])• Eyvind Wærstad Axelsen ([email protected])

Course’s web-page

http://www.uio.no/studier/emner/matnat/ifi/INF5110

• overview over the course, pensum (watch for updates)• various announcements, beskjeder, etc.

Course material and plan

• based roughly on [6] and [9], but also other sources will play a role. A classic is “thedragon book” [2], we might use part of code generation from there

• see also errata list at http://www.cs.sjsu.edu/~louden/cmptext/• approx. 3 hours teaching per week (+ exercises)• mandatory assignments (= “obligs”)

– O1 published mid-February, deadline mid-March– O2 published beginning of April, deadline beginning of May

• group work up-to 3 people recommended. Please inform us about such planned groupcollaboration

• slides: see updates on the net

Exam

(originally planned: 12th June, 09:00, 4 hours, written, open-book) now: oral exam, 11thand 12th June.

http://www.uio.no/studier/emner/matnat/ifi/INF5110

http://www.cs.sjsu.edu/~louden/cmptext/

1 Introduction1.1 Introduction 3

Motivation: What is CC good for?

• not everyone is actually building a full-blown compiler, but– fundamental concepts and techniques in CC– most, if not basically all, software reads, processes/transforms and outputs

“data”⇒ often involves techniques central to CC– understanding compilers ⇒ deeper understanding of programming language(s)– new languages (domain specific, graphical, new language paradigms and con-

structs. . . )⇒ CC & their principles will never be “out-of-fashion”.

Full employment for compiler writers

There is also something known as full employment theorems (FET), for instance for com-piler writers. That result is basically a consequence of the fact that the properties ofprograms (in a full-scale programming language) in general are undecidable. “In general”means: for all programs, for a particular program or some restricted class of programs,semantical properties may well be decidable.

The most well-known undecidable question is the so-called halting-problem: can one decidegenerally if a program terminates or not (and the answer is: provably no). But that’sonly one particular and well-known instance of the fact, that (basically) all properties ofprograms are undecidable (that’s Rice’s theorem). That puts some limitations on whatcompilers can do and what not. Still, compilation of general programming languages isof course possible, and it’s also possible to prove the compilation generally correct: acompiler is just one particular program itself, though maybe a complicated one. What isnot possible is to generally prove a property about all programs (like whether it halts ornot).

What limitations does that imply for compilers? The limitations concern in particularoptimizations. An important part of compilers is to “optimize” the resulting code (machinecode or otherwise). That means to improve the program’s performance without changingits meaning otherwise (improvements like using less memory or running faster, etc.) Thefull employment theorem does not refer to the fact that targets for optimization are oftencontradicting (there often may be a trade-off between space efficiency and speed). The fullemployment theorem rests on the fact that it’s provably undecidable how much memory aprogram uses or how fast it is (it’s a banality, since all of those questions are undecidable).Without being able to (generally) determine such performance indicators, it should be clearthat a fully optimizing compiler is unobtainable. Fully optimizing is a technical term inthat context, and when speaking about optimizing compilers or optimization in a compiler,one means: do some effort to get better performance than you would get without thateffort (and the improvement could be always or on the average). An "optimal" compiler isnot possible anyway, but efforts to improve the compilation results are an important partof any compiler.

More specifically the FET for compiler writers is often phrased in a slightly refined manner,namely:

4 1 Introduction1.2 Compiler architecture & phases

It can be proven that for each “optimizing compiler” there is another one thatbeats it (which is therefore “more optimal”).

Since it’s a mathematical fact that there’s always room for improvement for any compilerno matter how “optimized” already, compiler writers will never be out of work (even inthe unlikely event that no new programming languages or hardwares would be developedin the future. . . ).

The proof of that fact is rather simple (if one assumes the undecidability of the haltingproblem as given, whose proof is more involved). However, the proof is not constructivein that it does not give a concrete construction of how to actually optimize a given com-piler. Well, of course, if that could be automated, then compiler writers would again faceunemployement. . .

1.2 Compiler architecture & phases

What is important in the architecture is its “layered” structure, consisting of phases. Itbasically a “pipeline” of transformations, with a sequence of characters as input (the sourcecode) and a sequence of bits or bytes as ultimate output at the very end. Conceptually,each phase analyzes, enriches, transforms, etc. and afterwards hands the result over tothe next phase.

This section is just a taste of the general, typical phases of a full-scale compiler. Ofcourse, there may be compilers in the broad sense, that don’t realize all phases. Forinstance, if one chooses to consider a source-to-source transformation as a compiler (known,not surprisingly as S2S or source-to-source compiler), there would be not machine codegeneration (unless of course, it’s a machine code to machine code transformation. . . ). Alsodomain specific languages may be unconventional compared to classical general purposelanguages and may have consequently unconventional architecture. Also, the phases ina compiler may be more fine-grained, i.e., some of the phases from the picture may besub-divided further. Still, the picture gives a fairly standard view on the architecture ofa typical compiler for a typical programming language, and similar pictures can be foundin all text books.

Each phase can be seen as one particular module of the compiler with an clearly definedinterface. The phases of the compiler naturally will be used to structure the lecture intochapters or sections, proceeding “top-down” during the semester. In the introduction here,we shortly mention some of the phases and their functionality.

1 Introduction1.2 Compiler architecture & phases 5

Figure 1.1: Structure of a typical compiler

Architecture of a typical compiler

Anatomy of a compiler


Pre-processor

• either separate program or integrated into compiler• nowadays: C-style preprocessing sometimes seen as “hack” grafted on top of a com-

piler.• examples (see next slide):

– file inclusion– macro definition and expansion– conditional code/compilation: Note: #if is not the same as the if-programming-

language construct.• problem: often messes up the line numbers (among other things)

The C-prepocessor is called a “hack” on the slides. C-preprocessing is still considered auseful hack, otherwise it would not be around . . . But it does not naturally encourage ele-gant and well-structured code, just fixes for some situations. The C-style preprocessor hasbeen criticized variously, as it can easily lead to brittle, confusing, and hard-to-maintaincode. By definition, the pre-processor does its work before the real compiler kicks in:it massages the source code before it hands it over to the compiler. The compiler is acomplicated program and it involves complicated phases that try to “make sense” of theinput source code string. It classifies and segments the input, cuts it into pieces, buildsup intermediate representations like graphs and trees which may be enriched by “seman-tical information”. However, not on the original source code but on the code after thepreprocessor made its rearrangements. Already simple debugging and error localizationquestions like “in which line did the error occur” may be tricky, as the compiler can makeits analyses and checks only on the massaged input, it never even seens the “original”code.

Other aspect concerns file inclusion using #input. The single most primitive way of“composing” programs split into separate pieces into one program. It’s basically thatinstead of copy-and-paste some code contained in a file literally, it simply “imports” it viathe preprocessor. It’s easy, understandable (and thereby useful), completely transparenteven for a beginner, and is a trivial mechanism as far as compiler technology is concerned.If used in a disciplined way, it’s helpful, but it’s not really a decent modularization concept(or: it “moduralizes” the program on the “character string” level but not on any moredecent, program language level).

The lecture overall will not talk about preprocessing but focuses on the compiler itself.

C-style preprocessor examples

#include <fi lename>

Listing 1.1: file inclusion

#varde f #a = 5 ; #c = #a+1. . .#i f (#a < #b)

. .


#else. . .

#endif

Listing 1.2: Conditional compilation

Also languages like TEX, LATEX etc. support conditional compilation (e.g., if<condition>... else ... fi in TEX). As a side remark: The sources for these slides and thisscript make quite some use of conditional compilation, compiling from the source code tothe target code, for instance PDF: some text shows up only in the script-version but notthe slides-version, pictures are scaled differently on the slides compared to the script . . .

C-style preprocessor: macros

#macrodef hentdata (#1,#2)−−− #1−−−−#2−−−(#1)−−−

#enddef

. . .#hentdata ( kar i , per )

Listing 1.3: Macros

−−− kar i−−−−per−−−(ka r i)−−−

Note: the code is not really C, it’s used to illustrate macros similar to what can bedone in C. For real C, see https://gcc.gnu.org/onlinedocs/cpp/Macros.html.Comditional compilation is done with

#if, #ifdef, #ifndef, #else, #elif. and #endif. Definitions are done with#define.

Scanner (lexer . . . )

• input: “the program text” ( = string, char stream, or similar)• task

– divide and classify into tokens, and– remove blanks, newlines, comments . . .

• theory: finite state automata, regular languages

Lexer or scanner are synonymous. The task of the lexer is what is called lexicographicanalysis (hence the name). That’s distinguished from syntactic analysis which comesafterwards and is done by the parser. The lecture will cover both phases to quite someextent, in particular parsing.

https://gcc.gnu.org/onlinedocs/cpp/Macros.html


Scanner: illustration

a [ index ] ␣=␣4␣+␣2

lexeme token class valuea identifier "a" 2[ left bracketindex identifier "index" 21] right bracket= assignment4 number "4" 4+ plus sign2 number "2" 2

012 "a"

...

21 "index"22

...

The terminology of tokens, token classes, lexemes, etc. will be made more clear in thechapters about lexing and parsing.

The input code snippet is supposed to be a sequence of characters (or a string). The blanks(space characters, or white spaces) are specially marked. The table afterwards shows theindividual pieces of that string. Those pieces are called lexemes. Note that the white spacesare ignored, there is no white-space lexeme (which is typical when do scanning). Thatdoes not mean that white-space is completely “meaningless” in the sense that one couldadd and remove white space arbitrarily. That’s common for most programming languagesnowadays (and most written written languages based on an lettered alphabet, like Westernlanguages). We will see in the chapter about lexing, that there had for instance versionsof Fortran, which treated white-space as completely meaningless, in that sense that itwas treated as if not there at all. Here, in the example, as in basically all programminglanguages, white space is not completely meaningless: it serves as a form of separator.Like the string index in the example counts as one lexeme, one unit of the overall string,which is classified as identifier in the table (the so-called token class). Note: if it hadbeen written with one white space as in dex, then the scanner would have returnedtwo identifiers. Presumably that would make the overall string syntactically wrong, butthat’s a question for the parser to decide, not the lexer. Note also, that index, withoutthe white space, is marked as one identifer, not as two or maybe 5 individual ones. Thatimplies, that the lexer tries to find the longest stretch of characters that can be interpreted(for instance here) as identifiers (uninterrupted by white space or other characters thatare disallowed for identifiers). All that sounds obvious (because one is so much used toit), but, as mentioned, there are different ways to interpret white spaces (meaningless,or as separator, one may even interpret indentation, which is a sequence of white spacesor tabs to have some grouping meaning), to have some meaning beyond looking nice forthe programmer). Rules governing lexical aspects of the language cover all that: whatare allowed characters for identifiers, actually; what are overall the allowed reservoir ofcharacters (called the alphabe), what are white spaces (blanks, tabs, newlines, carriagereturns, others?), what’s a comment?


One may ask: what are then exactly lexical aspects of a language? A non-helpful andtautological answer is: those aspects that are dealt with by the lexer. A better answer is:those aspects that can be captured by regular expressions. Lexer generator tools (like lexand similar ones) can be seen as tools which allow to specify lexical aspects of a language byregular expressions, and they use that specification to generate a lexer or scanner program.Basically, realizing a finite state automaton that performs the lexing task. What cannotbe covered by regular expressions resp. finite state automata, is handed over to the nextphase(s), the next one the parser, which is responsible for syntactic aspects. Those areascpects that can be covered by some more expressive formalism, known as context-freegrammars.

The parser is the phase after the lexer. It is responsible for checking syntactic aspectsof the language and hand over to the next phases a intermediate representation thatcaptures the syntax of a syntactically correct program. This representation is called the“syntax tree”. Actually, there are two kinds of trees involved when parsing a program,more precisely parsing a token stream of a lexically correct program generated by thelexer. The two forms of syntax trees are known as conctrete syntax tree or parse tree onthe one hand and abstract syntax tree on the other. We we discuss these extensively inthe corresponding parts of the lecture.

a[index] = 4 + 2: parse tree/syntax tree

expr

assign-expr

expr

subscript expr

expr

identifiera

[ expr

identifierindex

]

= expr

additive expr

expr

number4

+ expr

number2

a[index] = 4 + 2: abstract syntax tree

assign-expr

subscript expr

identifiera

identifierindex

additive expr

number2

number4

The trees here are mainly for illustration. It’s not meant as “this is how the abstract syntaxtree looks like” for the example. In general, abstract syntax trees are less verbose that


parse trees. The latter are sometimes also called concrete syntax trees. The parse tree(s)for a given word are fixed by the grammar. One should more precisely say “context-freegrammar” as there are also more expressive grammars, but without further qualification,the word “grammar” often just means context-free grammar. The abstract syntax treeis a bit a matter of design. Of course, the grammar is also a matter of design, but oncethe grammar is fixed, the form of parse trees are fixed, as well. What is typical in theillustrative example is: an abstract syntax tree would not bother to add nodes representingbrackets (or parentheses etc), so those are omitted. In general, ASTs are more compact,ommitting superfluous information without omitting relevant information.

When saying the grammar fixes the form of the parse-trees, it is not meant that, given onesequence of tokens, then there is exactly one parse tree. That kind of “fixing” is not meant,what is fixed is the general format of allowed parse trees. A grammar, where for each inputtoken stream, there is at most one parse tree is is called unambiguous, some grammarsare and some not. At any rate, ambigous grammars are unwelcome, and parsers realizetypically unambigous grammars. Parser generators (like yacc and similar), when fed withan ambigous grammar as specification, will indicate so-called conflicts. That are pointswhere the parser has different options as reaction to an input, which is not a good thing.The parser would typically make some form of decision (like taking just the first optionand ignoring the alternatives), but it’s not a good sign. It typically indicates troubleswith the grammar. To avoid misconceptions: an ambiguous grammar will lead to conflictsin such tools, but the other way around is not true: a parser may indicate conflictingsituations even if the grammar is unambigous. The reason is that parsers typically are notexpressive enough to cover all kind of context-free grammars, not even all unambigiousones. They focus on (different) more restricted classes of context-free grammars. We willencounter different conflicts in the corresponding chapter.

Semantical analysis

The semantical analysis deals with properties more complex than the language’s syntax.There are very many ingredients to be dealt with beyong syntax, which means, the partthat comes after parsing is often big and complicated, and cover different things. Alsothe underlying principles and theories is less “uniform”, it’s more that various differentconcepts come into play. One typical phase that often comes directly after parsing and thusworks directly with the AST is type checking. It can be understood as “decorating” theAST with type information, as illustrated in the following pictures. It may not be that it’sconcretely implemented that one adds information directly into the AST structure. Often,especially the semantic analysis phase work with some structure called symbol table thatmaintains information about syntactic entities for easy consultation during the analysis.

(One typical) Result of semantic analysis

• one standard, general outcome of semantic analysis: “annotated” or “decorated”AST

• additional info (non context-free):– bindings for declarations


– (static) type information

assign-expr

additive-expr

number

2

number

4

subscript-expr

identifier

index

identifier

a :array of int :int

:array of int :int

:int :int

:int :int

:int :int

: ?

• here: identifiers looked up wrt. declaration• 4, 2: due to their form, basic types.

Non-negotionable is, of course, to generate correct code, i.e., code that correctly andfor all program reflects the language’s intended semantics. In particular, it needs torealize all the fancy programming abstractions the language may offer. Even variables areabstractions, they may “feel” like changing directly the “memory” of the machines oneruns the program on, but they offer typically are already quite some level of abstraction.Ultimately, from the perspective of the compiler and machine code, one has to operatewith addresses and perhaps the value is stored temporarily in registers. Not only havevariables symbolic names chosen by the programmer, they are also organized in scopes,they may be local or global etc. Variables may be formal parameters of a procedure.All those are very convenient abstractions, which need to be realized (by the compiler) bymanaging the memory propery. Each variable access must ultimately translated in perhapsa sequence of machine instructions, which ultimately access the current correspondinglocation which holds the value of the variable. All that invisible for the programmer,who thinks in terms of variables and has an inutive feeling of scopes and locality of thevariable, like: “x is a variable local to procedure p”. Of course, if p is called multiple times,perhaps recursively, there are multiple instances of x to be managed at run-time. Thecorresponding arrangements realized by the compiler is called the run-time environment.Also, parameter passing needs to be arranged by the compiler, since at the lowest level ofmachine code, there’s no such things as variables or “passing them”, it’s just sequences ofcleverly designed machine instructions that realize parameter passing, scoping etc.

So, the non-negotiable correctness requirement for a compiler basically means to maintainthose abstractions: the programmer thinks in terms of parameter passing: the formalparameters are “replaced” by the actual parameters, but this is broken down to perhapsmany individual, very small steps, perhaps even shuffling around values in registers etc.which behave, when thinking about a higher level of abstraction, like parameter passing.

Besides correctness of the generated code, there is the question, how efficient the gener-ated code maintains the abstractions. Optimization addresses efficiency without of coursecompromizing correctness. Optimization can be done in various phases of the compiler,and also repeatedly. We don’t go too much into issues in connection with compilers. Theexamples on the slide illustrate different versions of a code snippet, some presumably more


efficient than others (thus “optimized”). The word, “optimizing” is anyway a bit of a mis-nomer, as a compiler that guarantees genuninely optimal code is unobtainable (even if onecould agree on criteria to measure the quality). Besides that, there are influences outsidethe control of the compiler, which influence the efficiency of the result. The examplesshown here are on the level of source code, but often similar “optimizations” are done(also) on lower levels, a for instance a a so-called intermediate code level or at machinecode level (or both). The improvements illustrated on the slides here can be made sys-temantic with techniques called data-flow analyses. We don’t do too much there, but wewill cover one important such analysis called liveness analysis.

Optimization at source-code level

assign-expr

subscript expr

identifiera

identifierindex

number6

1

t = 4+2;a[index] = t;

2

t = 6;a[index] = t;

3

a[index] = 6;

The code examples show 3 different “variants” of semantically the same program. Theoptimizations are not very radical and complicated, but doing corresponding steps in morecomplex situations can be challenging. For instance, in the steps here, it’s not always sotrivial to figure out that a value or variable is actually constant (in the example it’sobvious).

The lecture will not dive too much into optimizations. The ones illustrated here are knownas constant folding and constant propagation. Optimizations can be done (and actuallyare done) in various phases on the compiler. Here we said, optimization at “source-codelevel”, and what is typically meant by that is optimization on the abstract syntax tree(presumably at the AST after type checking and some semantic analysis). The AST is


considered so close to the actual input that one still considers it as “source code” and noone tries seriouisly optimize code a the input-string level. If the compiler “massages” theinput, it’s mostly not seen as optimization, it’s rather (re-)formatting. There are indeedformat-tool that assist the user to have the program is a certain “standardized” format(standard indentation, new-lines appropriately, etc.)

Concerning optimization, what is also typical is, that there are many different optimiza-tions building upon each other. First, optimization A is done, then, taking the result,optimization B, etc. Sometimes even doing A again, and then B again, etc.

Code generation & optimization

MOV␣␣R0 , ␣ index ␣ ; ; ␣␣ value ␣ o f ␣ index ␣−>␣R0MUL␣␣R0 , ␣2␣␣␣␣␣ ; ; ␣␣double ␣ value ␣ o f ␣R0MOV␣␣R1 , ␣&a␣␣␣␣ ; ; ␣␣ address ␣ o f ␣a␣−>␣R1ADD␣␣R1 , ␣R0␣␣␣␣ ; ; ␣␣add␣R0␣ to ␣R1MOV␣∗R1 , ␣6␣␣␣␣␣ ; ; ␣␣ const ␣6␣−>␣address ␣ in ␣R1

MOV␣R0 , ␣ index ␣␣␣␣␣␣ ; ; ␣ va lue ␣ o f ␣ index ␣−>␣R0SHL␣R0␣␣␣␣␣␣␣␣␣␣␣␣␣ ; ; ␣ double ␣ value ␣ in ␣R0MOV␣&a [R0 ] , ␣6␣␣␣␣␣␣ ; ; ␣ const ␣6␣−>␣address ␣a+R0

• many optimizations possible• potentially difficult to automatize1, based on a formal description of language and

machine• platform dependent

For now it’s not too important what the code snippets do. It should be said, though, thatit’s not a priori always clear in which way a transformation such as the one shown is animprovement. One transformation that most probably is an improvement, that’s the “shiftleft” for doubling. Another one is that the program is shorter. Program size is somethingthat one might like to “optmize” in itself. Also: ultimately each machine operation needsto be loaded to the processor (and that costs time in itself). Note, however, that it’sgenerally not the case that “one assembler line costs one unit of time”. Especially, the lastline in the second program could costs more than other simpler operations. In general,operations on registers are quite faster anyway than those referring to main memory. Inorder to make a meaningful statement of the effect of a program transformation, one wouldneed to have a “cost model” taking register access vs. memory access and other aspectsinto account.

1Not that one has much of a choice. Difficult or not, no one wants to optimize generated machine codeby hand . . . .


Anatomy of a compiler (2)

Misc. notions

• front-end vs. back-end, analysis vs. synthesis• separate compilation• how to handle errors?• “data” handling and management at run-time (static, stack, heap), garbage collec-

tion?• language can be compiled in one pass?

– E.g. C and Pascal: declarations must precede use– no longer too crucial, enough memory available

• compiler assisting tools and infrastructure, e.g.– debuggers– profiling– project management, editors– build support– . . .

Compiler vs. interpeter

compilation

• classical: source ⇒ machine code for given machine• different “forms” of machine code (for 1 machine):

– executable ⇔ relocatable ⇔ textual assembler code

1 Introduction1.3 Bootstrapping and cross-compilation 15

full interpretation

• directly executed from program code/syntax tree• often for command languages, interacting with the OS, etc.• speed typically 10–100 slower than compilation

compilation to intermediate code which is interpreted

• used in e.g. Java, Smalltalk, . . . .• intermediate code: designed for efficient execution (byte code in Java)• executed on a simple interpreter (JVM in Java)• typically 3–30 times slower than direct compilation• in Java: byte-code ⇒ machine code in a just-in time manner (JIT)

More recent compiler technologies

• Memory has become cheap (thus comparatively large)– keep whole program in main memory, while compiling

• OO has become rather popular– special challenges & optimizations

• Java– “compiler” generates byte code– part of the program can be dynamically loaded during run-time

• concurrency, multi-core• virtualization• graphical languages (UML, etc), “meta-models” besides grammars

1.3 Bootstrapping and cross-compilation

Let’s just glance over this section, we will not discuss in much in class. It’s not partof the pensum for the written exam (and also not for the oral), but may be interesting.Bootstrapping refers to a process of “building something out of nothing”, like in the talefrom the guy that used his own bootstraps to pull himself out of a swamp. Of course onehas to “start somewhere”, dragging oneself out of the swamp by one owns bootstrap in thisway without some place to stand on is possible in some funny tale only. Bootstrappingis also the origin of the term “to boot”, which refers to firing out a computer system bystarting its OS. That’s a multi-stage process, which gradually “escalates” from hardware,the master boot record, boot loader etc., until the whole OS is up and running.

For writing a compiler, one faces (or maybe historically faced) the task: how can I write acompiler from scratch? Well, one can do of course implement the whole thing in assembler;the hardware certainly has some instruction set, and one can use that to implement thedesired compiler. That’s a tough call, one would rather avoid using assembler (exceptperhaps for carefully selected special tiny subtasks) and make use of a high-level language,

16 1 Introduction1.3 Bootstrapping and cross-compilation

with all its abstractions and other infrastructure, like libraries, editors, configuration andversion management etc. and perhaps even textbooks, tutorials and training.

If such a language is not around, well: That’s the chicken-and-egg problem of bootstrap-ping a compiler: If one had a compiler executable, one could (more) easily write thecompiler program and compile that source code to an executable compiler.

Nowadays, the problem is perhaps not so pressing insofar that there are enough high-levellanguages around. Assuming one is happy with C as high-level language and intends toinvent C++, one can write the first C++ compiler in C, of course, and that’s more easyat least compared to write it in assembler.

But there had been a time, before the era of the PC and the mass-marked for electroniccomputers, where ordering a computer means ordering a cabinet of hardware, with aninstruction set (and no internet to quickly download something useful). Perhaps the HWcame with some operating system, but maybe it was kind of rudimentary compared tomodern situation, barely able to process “jobs” (by reading punchcards) and controlingother peripherals.

In such a situation, perhaps there is no compiler for a high-level language available. That’swhere bootstrapping for compiler comes in: instead of writing the production-quality com-piler from scratch in assembler (which is too tough) and instead of writing the compiler forthe newly design language in the language itself (which makes no sense), one goes gradu-ally. One starts with a simple version of some relevant aspects of the planned language,not optimized, etc., until that rudimentary compiler exists. One then starts writing inthat new language better or more comprehensive versions of that language etc., until onehas a decent stable version that is strong enough to compiler itself without much relianceon assembler as “source code”.

Historically, the development of the language C when hand-in-hand with the developmentof Unix, insofar it was a larger “bootstrapping problem”. How to develop a (at that time)modern operating system together with a compiler that can compile C programs and cancompile the operating system itself, on which then the C programs run. . . . Not thatthe fact that the OS is written in a high-level language (for most part) is enormouslyimportant, as it allows portability. If any OS had to be written totally from scratch, therewould be no portability across different hardware platforms. And, as with compilers andlanguage, there had been a time where that was basically the norm.

The compilation process is here illustrated with so-called T-diagrams which is some “graph-ical” representation of the compilation process, mentioning the input language, the outputlanguage, and the language in which the compiler is represented as the three arms of the“T”.

Compiling from source to target on host

“tombstone diagrams” (or T-diagrams). . . .


Two ways to compose “T-diagrams”


Using an “old” language and its compiler for write a compiler for a “new” one

Pulling oneself up on one’s own bootstraps

bootstrap (verb, trans.): to promote or develop . . . with little or no assistance

— Merriam-Webster

http://www.merriam-webster.com/dictionary/bootstrap


There is no magic here. The first thing is: the “Q&D” compiler in the diagram is said tobe in machine code. If we want to run that compiler as executable (as opposed to beinginterpreted, which is ok too), of course we need machine code, but it does not mean thatwe have to write that Q&D compiler in machine code. Of course we can use the approachexplained before that we use an existing language with an existing compiler to create thatmachine-code version of the Q&D compiler.

Furthermore: when talking about efficiency of a compiler, we mean (at least here) exactlythat: it’s the compilation process itself which is inefficent! As far as efficency goes, one theone hand the compilation process can be efficient or not, and on the other the generatedcode can be (on average and given competen programmers) be efficent not. Both aspectsare not independent, though: to generate very efficient code, a compiler might use manyand aggressive optimizations. Those may produce efficient code but cost time to do. Atthe first stage, we don’t care how long it takes to compile, and also not how efficient isthe code it produces! Note the that code that it produces is a compiler, it’s actually asecond version of “same” compiler, namely for the new language A to H and on H. Wedon’t care how efficient the generated code, i.e., the compiler is, because we use it just inthe next step, to generate the final version of compiler (or perhaps one step further to thefinal compiler).

Bootstrapping 2


Porting & cross compilation

The situation is that K is a new “platform” and we want to get a compiler for our newlanguage A for K (assuming we have one already for the old platform H). It means thatnot only we want to compile onto K, but also, of course, that it has to run on K. Theseare two requirements: (1) a compiler to K and (2) a compiler to run on K. That leads totwo stages.

In a first stage, we “rewrite” our compiler for A, targeted towards H, to the new platformK. If structured properly, it will “only” require to port or re-target the so-called back-end from the old platform to the new platform. If we have done that, we can use ourexecutable compiler on H to generate code for the new platform K. That’s known ascross-compilation: use platform H to generate code for platform K.

But now, that we have a (so-called cross-)compiler from A to K, running on the oldplatform H, we use it to compile the retargeted compiler again!

2 Scanning 21

ScanningChapter

Whatis it

about?Learning Targets of this Chapter1. alphabets, languages2. regular expressions3. finite state automata / recognizers4. connection between the two

concepts5. minimization

The material corresponds roughlyto [6, Section 2.1–2.5] or a largepart of [9, Chapter 2]. The materialis pretty canonical, anyway.

Contents

2.1 Introduction . . . . . . . . . . 212.2 Regular expressions . . . . . . 312.3 DFA . . . . . . . . . . . . . . 462.4 Implementation of DFAs . . . 562.5 NFA . . . . . . . . . . . . . . 592.6 From regular expressions

to NFAs (Thompson’sconstruction) . . . . . . . . . 61

2.7 Determinization . . . . . . . . 672.8 Minimization . . . . . . . . . 702.9 Scanner implementations

and scanner generation tools 78

2.1 Introduction

The scanner or lexer is the first phase of a typical compiler (leaving out preprocessing,which is more seen as something that happens “before” the compiler does its job). What alexer does is also called lexical analysis, basically chopping up the input string into smallerunits (so-called lexemes), classifying them according to the lexical rules of the languageone implements, and haning over the results of that chopping-up-and-classification to theparser in a stream of so-called tokens. The theory underlying lexers is that of regularlanguages. Typically, lexical aspects of a language are specified using some variant ofregular expression. The lexer program then has to realze that specification, in that it isable to read in the source program (it scans it) and the checks it for compliance with thespecification and, and the same time, does the chopping-and-classification task mentioned(it tokenizes the input string). Checking for compliance with regular expression is donevia finite-state machines. Finite state machines are equivalent to regular expression inso-far that they can describe the same languages. Here, language is meant as sequence ofcharacters from an alphabet. Regular expressions are declarative in nature (hence moreuseful for specification), whereas finite state automata are more operational in nature,hence used in implementing a scanner. We discuss how to translate regular expressions toautomata. The reverse translation is also possible (and easy), but we don’t discuss that,as it’s not needed for a compiler. The lex tool actually does just that: the users specifiesthe lexical aspects of the language to compile and lex generates from that the lexer for

22 2 Scanning2.1 Introduction

that language, based on the theory of regular expression and finite state automata. Actu-ally, tools like lex do a bit more, which mostly has to do with with support to generatetokens and to interface properly with the parser. Parsing will be covered in subsequentchapters.

Scanner section overview

What’s a scanner?

• Input: source code.• Output: sequential stream of tokens

• regular expressions to describe various token classes• (deterministic/non-determinstic) finite-state automata (FSA, DFA, NFA)• implementation of FSA• regular expressions → NFA• NFA ↔ DFA

We said the input of a scanner is the “source code”. That’s a bit unspecific. It’s often a“character stream” or a “string” (of characters). Practically, the argument of a scanner isoften a file name or an input stream or similar. Or the scanner in it’s basic form takes acharacter stream, but it “alternatively” also accepts a file name as argument (or even anurl). In that case, of course, the string of the file name is not scanned as source code, butit’s used to access the corresponding file, whose content is then read in in the form of a“string” or whatever.

What’s a scanner?

• other names: lexical scanner, lexer, tokenizer

A scanner’s functionality

Part of a compiler that takes the source code as input and translates this stream ofcharacters into a stream of tokens.

• char’s typically language independent.• tokens already language-specific.• works always “left-to-right”, producing one single token after the other, as it scans

the input• it “segments” char stream into “chunks” while at the same time “classifying” those

pieces ⇒ tokens

2 Scanning2.1 Introduction 23

Characters are typically language-independent, but perhaps the encoding (or its interpre-tation) may vary, like ASCII, UTF-8, also Windows-vs.-Unix-vs.-Mac newlines etc. Incontrast, tokens are already language dependent, in particular, specific for the grammarused to describe the language. There are, hower, large commonalities across many lan-guages. Many languages support, for instance, strings and integers, and consequently, it’splausible that the grammar will make use of corresponding tokens (perhaps called INTand STRING, the names are arbitrary, like variable names, but it is a good idea to call thetoken representing string STRING or similar. . . ). Tokens are not just language-specificwrt. to the language being implemented. They are also show up in the implementation,i.e., they are specific to the meta-language used to implement the compiler. As in thislecture, we will use a parser and lexer generating tool (a variant of lex and yacc), and therepresentation of the tokens is also specific to the chosen tool.

The slides also mention that scanning works from left to right. That is not a theoreticalnecessity, but that’s how also humans consume or “scan” a source-code text. At leastthose humans trained in e.g. Western languages.

Typical responsibilities of a scanner

• segment & classify char stream into tokens• typically described by “rules” (and regular expressions)• typical language aspects covered by the scanner

– describing reserved words or key words– describing format of identifiers (= “strings” representing variables, classes . . . )– comments (for instance, between // and NEWLINE)– white space

∗ to segment into tokens, a scanner typically “jumps over” white spaces andafterwards starts to determine a new token

∗ not only “blank” character, also TAB, NEWLINE, etc.• lexical rules: often (explicit or implicit) priorities

– identifier or keyword? ⇒ keyword– take the longest possible scan that yields a valid token.

“Scanner = regular expressions (+ priorities)”

Rule of thumb

Everything about the source code which is so simple that it can be captured by reg.expressions belongs into the scanner.


How does scanning roughly work?

. . . a [ i n d e x ] = 4 + 2 . . .

q0q1

q2

q3 . . .

qn

Finite control

q2

Reading “head”(moves left-to-right)

a[index] = 4 + 2

How does scanning roughly work?

• usual invariant in such pictures (by convention): arrow or head points to the firstcharacter to be read next (and thus after the last character having been scanned/readlast)

• in the scanner program or procedure:– analogous invariant, the arrow corresponds to a specific variable– contains/points to the next character to be read– name of the variable depends on the scanner/scanner tool

• the head in the pic: for illustration, the scanner does not really have a “reading head”

The picture of a reading head may be reminiscent of the typical picture illustrating Turingmachines (which is not a coincidence). But a “reading head” is is not just a theoreticalconstruct. In the old times, program data may have been stored and read from magnetictape. Very deep down, if one still has a magnetic disk as opposed to an SSD, the secondarystorage still has “magnetic heads”, only that the compiler typically does not scan or parsedirectly char by char from disk. . .

The bad(?) old times: Fortran

• in the days of the pioneers

• main memory was smaaaaaaaaaall• compiler technology was not well-developed (or not at all)• programming was for very few “experts”.1

1There was no computer science as profession or university curriculum.


• Fortran was considered high-level (wow, a language so complex that you had tocompile it . . . )

(Slightly weird) lexical ascpects of Fortran

Lexical aspects = those dealt with by a scanner

• whitespace without “meaning”:I F( X 2. EQ. 0) TH E N vs. IF ( X2. EQ.0 ) THEN

• no reserved words!IF (IF.EQ.0) THEN THEN=1.0

• general obscurity tolerated:DO99I=1,10 vs. DO99I=1.10

DO␣99␣ I =1 ,10␣−␣−99␣CONTINUE

We have a look at Fortan to get a feeling for “alternative” ways how to deal with lexicalaspects of a language (no more supported). It’s in a way like in the super-old dayswhen there was no “white space” in writing (for instance ancient Latin) and it enteredmanuscripts (in Latin or emerging Western European languages) only slowly. It provedhelpful in reading for humans, of course.

Over time, also programming languages adopted a more “helpful” treatment of lexicalaspects. Remember that one core task of scanning is segmenting the input, and whitespace can help there. If one treats white space as “basically not there” and thus absolutelymeaningless, one does a big disfavor for human readability: humans are used to white spacesince the the times of no-white-space texts are long gone. One reason why initially Fortrantreated white space like that was perhaps: it may have been the easiest thing to do: ifthe scanner reads a white space, do nothing and proceed. Or perhaps the motivationwas to allow “compact programs”. It allowed the expert programmers to write programswithout wasting precious memory for “white space” in the source code. Note that inthe conventional interpretation of white space nowadays, white space does not exactly


represent “nothing” in that one can put it in or out without changing the meaning. Whitespace has no meaning by itself but terminates preceding non white space.

That treatment is so conventional, that most compilers use more or less the same definitionof “white space” though there is typically not only one “white space” character. There istabs, spaces, and then there is different “end-of-line” representations (carriage-return, end-of-line, newline). In a way, things like “carriage-return” CR and "tabulation command” isanyway a hold-over from the times of the mechanical and electrical type-writer era: at thebeginning, the input and output peripheral devices connected to a computer were not justtrying to behave like type-writer in software, there were actually sort of type-writers (andbuilt in many cases by IBM at any rate. . . ) and one needed some encoding of files to drivethem or read input from them: the encoding for the “bell character” actually may resultin banging on a small copper bell. . . Standard devices like tty are likewise remembranceof reat hardware teletype (typewrite-style) terminals, Since those codes are part of theASCII-code (from the 60ies), those “characters” or “symbols” are here to stay. . .

There are other “unconventional” ways to deal with white space. For instance, one couldmake the decision that "indentation" (via "tabbing" or otherwise) has a meaning, as op-posed to be another example of whitespace. Python is an example of a language, whereindentation is “meaningful” and the lexer (and parser) must be aware of that.

Proper and improper indentation is sometimes also layed down in style guides. It’s morelike a recommendation for programmers to follow for writing “pretty programs”. Iimproperindentation would make no semantic difference on the compilation (as would be the casein Python), it’s just frowned up as bad taste. Perhaps the compiler would utter somecritizism or warning about the uglyness of the program. A lexer that to support checking ofsome stylistic guidelines and proper formatting would have to distinguish between differentthings commonly just treated as whitespace. That’s not surprising, as those tools focuson how nice the program looks (to the user). There are formatting tools (for instancegofmt for go) that transforms a program is a nicely written one that follow the stylicticguidelines.

The lecture will neither be concerned with the stone-age treatment of white-space as inold Fortran, not with the more elaborate ways discussed afterward.

Fortran scanning: remarks

• Fortran (of course) has evolved from the pioneer days . . .• no keywords: nowadays mostly seen as bad idea• treatment of white-space as in Fortran: not done anymore: THEN and TH EN are

different things in all languages• however: both considered “the same”:

i f ␣b␣ then␣ . .

i f ␣␣␣b␣␣␣␣ then␣ . .

• since concepts/tools (and much memory) were missing, Fortran scanner and parser(and compiler) were


– quite simplistic– syntax: designed to “help” the lexer (and other phases)

The treatment of white space is mostly a question of language pragmatics. Pragmaticsdeals with non-formal questions like “what’s helpful for humans that program in a lan-guage”, like what’s “user-friendly syntax”. Lexers/parsers would have no problems usingwhile as variable, but humans tend to. Pragmatics for instance also deals with questionslike “how verbose should a language design be”, how much syntactic sugar it should offer,etc. There is mostly no commonly agreed best answer, and it depends also on what kind ofuser one targets and to some large part on the personal taste and education or experienceof the programmer.

Sometimes, the part of a lexer / parser which removes whitespace (and comments) isconsidered as separate and then called screener. It’s not a very common terminology,though.

A scanner classifies

• “good” classification: depends also on later phases, may not be clear till later

Rule of thumb

Things being treated equal in the syntactic analysis (= parser, i.e., subsequent phase)should be put into the same category.

• terminology not 100% uniform, but most would agree:

Lexemes and tokens

Lexemes are the “chunks” (pieces) the scanner produces from segmenting the input sourcecode (and typically dropping whitespace). Tokens are the result of classifying thoselexemes.

• token = token name × token value

A scanner classifies & does a bit more

• token data structure in OO settings– token themselves defined by classes (i.e., as instance of a class representing a

specific token)– token values: as attribute (instance variable) in its values

• often: scanner does slightly more than just classification– store names in some table and store a corresponding index as attribute– store text constants in some table, and store corresponding index as attribute– even: calculate numeric constants and store value as attribute


One possible classification

name/identifier abc123integer constant 42real number constant 3.14E3text constant, string literal "this is a text constant"arithmetic op’s + - * /boolean/logical op’s and or not (alternatively /\ \/ )relational symbols <= < >= > = == !=

all other tokens: { } ( ) [ ] , ; := . etc.every one it its own group

• this classification: not the only possible (and not necessarily complete)• note: overlap:

– "." is here a token, but also part of real number constant– "<" is part of "<="

The remark about the “overlap” refers to some aspect of the lexical analysis, that wasrefered to ealier as that the scanner has to deal with priorities. If one has some sequence<= (and the classification from the slide), then, without further elaboration, the < part ofthe string could be seen as representing the relation “less” and the subsequent symbol asequality. Whether or not that makes sense is not for the scanner to decide, the scannerjust chops up the string into pieces, and then the subsequent parser may complain thatit’s syntactically not allowed to have to relation symbols side by side (or not complain,depending on the grammar).

In that particular situation, language pragmatics would suggest that <= is not chopped-upbut treated as one chunk, i.e., one lexeme, namely representing the relation less-or-equal.

The same principle also apply to other entries in the classification. For instance abc123is intended in most languages as one identifier, not as abc followed by the number 123,or even more weiredly by three identifiers followed by three separate digits).

So, the “priority” is: prefer longer lexemes over shorter ones (with white-space as possible“terminator”, unlike as in old Fortran, as discussed earlier).

One way to represent tokens in C

typedef struct {TokenType tokenva l ;char ∗ s t r i n g v a l ;int numval ;

} TokenRecord ;

If one only wants to store one attribute:


typedef struct {Tokentype tokenva l ;union{ char ∗ s t r i n g v a l ;

int numval} a t t r i b u t e ;

} TokenRecord ;

The second version makes use of so-called union-types in C. The “union-type” of C hassome deficiencies (as some nowaways would say). We will shortly look at union types andothers in a later chapter about type checking. In any case, the concrete C implementationhere is not very relevant (maybe not even recommended. . . ).

One can do it analogously in Java using classes. When discussing the Oblig, we will givehints on how to do it and how it’s concretely represented in the version of lex/yacc wesuggest to use.

How to define lexical analysis and implement a scanner?

• even for complex languages: lexical analysis (in principle) not hard to do• “manual” implementation straightforwardly possible• specification (e.g., of different token classes) may be given in “prosa”• however: there are straightforward formalisms and efficient, rock-solid tools available:

– easier to specify unambigously– easier to communicate the lexical definitions to others– easier to change and maintain

• often called parser generators typically not just generate a scanner, but code forthe next phase (parser), as well.

Prosa specification

A precise prosa specification is not so easy to achieve as one might think. For ASCIIsource code or input, things are basically under control. But what if dealing with unicode?Checking “legality” of user input to avoid SQL injections or similar format string attackscan involve lexical analysis/scanning. If you “specify” in English: “ Backlash is a controlcharacter and forbidden as user input ”, which characters (besides char 92 in ASCII) inChinese Unicode represents actually other versions of backslash? Note: unclarities about“what’s a backslash” have been used for security attacks. Remember that “the” backslash-character in OSs often has a special status, like it cannot be part of a file-name but usedas separator between file names, denoting a path in the file system. If one can “smugglein” an inofficial (“chinese”) backslash into a file-name, one can potentially access parts ofthe file directory tree in some OS which are supposed to be inaccessible. Attacks like thathave been used.


Parser generator

The most famous pair of lexer+parser tools is called “compiler compiler” (lex/yacc = “yetanother compiler compiler”) since it generates (or “compiles”) an important part of thefront end of a compiler, the lexer+parser. lex/yacc originate from C, there are also gnu-versions around (called flex and bison). Those kinds of tools are seldomly called compilercompilers any longer. Many other languages ship with a corresponding pair of tools. Inthe lecture, someone from the audience mentioned Alex & Happy (for Haskell), ocaml hasocamllex/ocamlyacc, similar for other ML versions and other languages. Java, for somereason, does not ship with such a pair of tools; they exist though, for instance JLex andCUP.

Those tools are all based on the same principles, they work roughly similar and cangenerate parsers for the same class of languages (language in the theoretical meaning of setsof words over an alphabet): the lexers cover some (extended) form of regular expressionsand the parser does some form of bottom-up parsing, known as LARL(1) parsing. Thisform of parser/lexer generators inspired by lex/yacc is the bread-and-butter, standardversion. The overview at Wikipedia over different such tools is pretty long.

Sample prosa spec

The following is a excerpt from this year’s oblig concerning the lexical conventions forcompila 20. Actually, the lexical part does not really change over the years, or only inminor aspects. Anyway, it’s an example of a prosa specification, and the oblig involves inwriting a lexer for that. Since the lexer is supposed to be based on a lex-style tool, thismeans, one has to capture the prosa text in the regular language format of the chosen tool(for instance JLex).

https://www.haskell.org/alex/

https://www.haskell.org/happy/doc/html/sec-using.html

https://en.wikipedia.org/wiki/Comparison_of_parser_generators

2 Scanning2.2 Regular expressions 31

2.2 Regular expressions

Regular expressions are a very well-known concept and, with variations, used in differentapplications, inside compilers, editors, as system tools and utilities, for specifying searchpatterns, and many more. Many programming languages, notably “scripting” languagesoffer extensive support for working with regular expression and extended regular expres-sions. Besides that, they have been studied theoretically and there are also imporantgeneralizations (which are outside of the scope of the lecture). There are also “practical”variations. In the lecture, we start focusing on the “classic, vanilla core regular expres-sions”. They capture the relevant parts. For usability, one often likes to offer extra syntax,that makes the use of regular expression more convenient. The lex-style tools would like todo that. Adding more constructs to a language for convenience without really extendingthe expressivity of a languages in this way is sometime called “syntactic sugar”. Otherpractical extensions of regular expressions may fall outside that classification: one reallylikes to add expressivity. Those extensions are sometimes called “extended regular expres-sion”, and those may still keep central aspects of regular expression, but being actuallymore expressive, fall outside the formalisms that captures regular languages. We will lookat some abbreviations that fall into the “syntactic sugar” category, but won’t venture intogenuine extentions (which are technically not longer pure regular expressions, as said).Practically, for the oblig, one has to cope with the concrete syntax and posibilities of thechosen lexer generator, for instance JLex.

32 2 Scanning2.2 Regular expressions

General concept: How to generate a scanner?

1. regular expressions to describe language’s lexical aspects• like whitespaces, comments, keywords, format of identifiers etc.• often: more “user friendly” variants of reg-exprs are supported to specify that

phase2. classify the lexemes to tokens3. translate the reg-expressions ⇒ NFA.4. turn the NFA into a deterministic FSA (= DFA)5. the DFA can straightforwardly be implementated

• step done automatically by a “lexer generator”• lexer generators help also in other user-friendly ways of specifying the lexer: defining

priorities, assuring that the longest possible lexeme is tokenized

A lexer generator may even prepare useful error messages if scanning (not scanner gen-eration) fails, i.e., running the scanner on a lexically illegal program. Of course, if thescanner generation itself fails, also there meaingful errors messages and giving reasons forthe failure are welcome. A final source of error could be: the scanner generation producesa scanner, which is a Java/C/whatever program, and that one is incorrect, starting frombeing syntactically incorrect or ill-typed.

The classification in step 2 is actually not directly covered by the classical results thatstating reg-expr = DFA = NFA, it’s something extra. The classical constructions presentedhere are used to recognise (or reject) words. As a “side effect”, in an actual implementation,the “class” of the word needs to be given back as well, i.e., the corresponding token needsto be constructed and handed over (step by step) to the next compiler phase, the parser.

Use of regular expressions

• regular languages: fundamental class of “languages”• regular expressions: standard way to describe regular languages• not just used in compilers• often used for flexible “ searching ”: simple form of pattern matching• e.g. input to search engine interfaces• also supported by many editors and text processing or scripting languages (starting

from classical ones like awk or sed)• but also tools like grep or find (or general “globbing” in shells)

find . -name "*.tex"

• often extended regular expressions, for user-friendliness, not theoretical expressive-ness

As for the origin of regular expressions: one starting point is Kleene [8] and there hadbeen earlier works outside “computer science”.

Kleene was a famous mathematician and influence on theoretical computer science. Fun-nily enough, regular languages came up in the context of neuro/brain science! See the

https://en.wikipedia.org/wiki/Glob_(programming)


following link for the origin of the terminology. Perhaps in the early years, people liked todraw connections between between biology and machines and used metaphors like “elec-tronic brain”, artificial intelligence, etc. Oh, wait, AI we still have (and the word anddiscipline dates back to the 50ies. . . )

Alphabets and languages

Definition 2.2.1 (Alphabet Σ). Finite set of elements called “letters” or “symbols” or“characters”.

Definition 2.2.2 (Words and languages over Σ). Given alphabet Σ, a word over Σ is afinite sequence of letters from Σ. A language over alphabet Σ is a set of finite words overΣ.

• practical examples of alphabets: ASCII, Norwegian letters (capital and non-capitals)etc.

In this lecture: we avoid terminology “symbols” for now, as later we deal with e.g. symboltables, where symbols means something slighly different (at least: at a different level).Sometimes, the Σ is left “implicit” (as assumed to be understood from the context).

Remark: Symbols in a symbol table (see later)

In a certain way, symbols in a symbol table can be seen similar to symbols in the waywe are handled by automata or regular expressions now. They are simply “atomic” (notfurther dividable) members of what one calls an alphabet. On the other hand, in practicalterms inside a compiler, the symbols here in the scanner chapter live on a different levelcompared to symbols encountered in later sections, for instance when discussing symboltables. Typically here, they are characters, i.e., the alphabet is a so-called characterset, like for instance, ASCII. The lexer, as stated, segments and classifies the sequence ofcharacters and hands over the result of that process to the parser. The results is a sequenceof tokens, which is what the parser has to deal with later. It’s on that parser-level, thatthe pieces (notably the identifiers) can be treated as atomic pieces of some language, andwhat is known as the symbol table typcially operates on symbols at that level, not at thelevel of individual characters.

Languages

• note: Σ is finite, and words are of finite length• languages: in general infinite sets of words• simple examples: Assume Σ = {a, b}• words as finite “sequences” of letters

– ε: the empty word (= empty sequence)– ab means “ first a then b ”

• sample languages over Σ are1. {} (also written as ∅) the empty set

http://stackoverflow.com/questions/975465/why-are-regular-expressions-called-regular-expressions


2. {a, b, ab}: language with 3 finite words3. {ε} (6= ∅)4. {ε, a, aa, aaa, . . .}: infinite languages, all words using only a ’s.5. {ε, a, ab, aba, abab, . . .}: alternating a’s and b’s6. {ab, bbab, aaaaa, bbabbabab, aabb, . . .}: ?????

Remark 1 (Words and strings). In terms of a real implementation: often, the letters areof type character (like type char or char32 . . . ) words then are “sequences” (say arrays)of characters, which may or may not be identical to elements of type string, dependingon the language for implementing the compiler. In a more conceptual part like here wedo not write words in “string notation” (like "ab"), since we are dealing abstractly withsequences of letters, which, as said, may not actually be strings in the implementation.Also in the more conceptual parts, it’s often good enough when handling alphabets with 2letters, only, like Σ = {a, b} (with one letter, it gets unrealistically trivial and results maynot carry over to the many-letter alphabets). But 2 letters are often enough to illustratesome concepts, after all, computers are using 2 bits only, as well . . . .

Finite and infinite words

There are important applications dealing with infinite words, as well, or also infinitealphabets. For traditional scanners, one mostly is happy with finite Σ ’s and especiallysees no use in scanning infinite “words”. Of course, some character sets, while not actuallyinfinite, are large or extendable (like Unicode or UTF-8).

Sample alphabets

Often we operate for illustration on alphabets of size 2, like {a, b}. One-letter alphabetsare uninteresting, let alone 0-letter alphabets. 3 letter alphabets may not add much asfar as “theoretical” questions are concerned. That may be compared with the fact thatcomputers ultimately operate in words over two different “bits” .

How to describe languages

• language mostly here in the abstract sense just defined.• the “dot-dot-dot” (. . .) is not a good way to describe to a computer (and to many

humans) what is meant (what was meant in the last example?)• enumerating explicitly all allowed words for an infinite language does not work either

Needed

A finite way of describing infinite languages (which is hopefully efficiently implementable& easily readable)

Is it apriori to be expected that all infinite languages can even be captured in a finitemanner?


• small metaphor2.727272727 . . . 3.1415926 . . . (2.1)

Remark 2 (Programming languages as “languages”). When seen syntcactically as allpossible strings that can be compiled to well-formed byte-code, Java etc is also is a languagein the sense we are currently discussing, namely a set of words over unicode. But whenspeaking of the “Java-language” or other programming languages, one typically has alsoother aspects in mind (like what a program does when it is executed), which is not coveredby thinking of Java as an infinite set of strings.

Remark 3 (Rational and irrational numbes). The illustration on the slides with the twonumbers is partly meant as that: an illustration drawn from a field you may know. Thefirst number from equation (2.1) is a rational number. It corresponds to the fraction

3011 . (2.2)

That fraction is actually an acceptable finite representation for the “endless” notation2.72727272... using “. . . ” As one may remember, it may pass as a decent definition of ra-tional numbers that they are exactly those which can be represented finitely as fractions oftwo integers, like the one from equation (2.2). We may also remember that it is character-istic for the “endless” notation as the one from equation (2.1), that for rational numbers,it’s periodic. Some may have learnt the notation

2.72 (2.3)

for finitely representing numbers with a periodic digit expansion (which are exactly therationals). The second number, of course, is π, one of the most famous numbers which donot belong to the rationals, but to the “rest” of the reals which are not rational (and hencecalled irrational). Thus it’s one example of a “number” which cannot be represented by afraction, resp. in the periodic way as in equation (2.3).

Well, fractions may not work out for π (and other irrationals), but still, one may ask,whether π can otherwise be represented finitely. That, however, depends on what actuallyone accepts as a “finite representation”. If one accepts a finite description that describeshow to construct ever closer approximations to π, then there is a finite representation ofπ. That construction basically is very old (Archimedes), it corresponds to the limits onelearns in analysis, and there are computer algorithms, that spit out digits of π as long asyou want (of course they can spit them out all only if you had infinite time). But the codeof the algo who does that is finite.

The bottom line is: it’s possible to describe infinite “constructions” in a finite manner,but what exactly can be captured depends on what precisely is allowed in the descriptionformalism. If only fractions of natural numbers are allowed, one can describe the rationalsbut not more.

A final word on the analogy to regular languages. The set of rationals (in, let’s say,decimal notation) can be seen as language over the alphabet {0, 1, . . . , 9 .}, i.e., the deci-mals and the “decimal point”. It’s however, a language containing infinite words, such as2.727272727 . . .. The syntax 2.72 is a finite expression but denotes the mentioned infiniteword (which is a decimal representation of a rational number). Thus, coming back to the


regular languages resp. regular expressions, 2.72 is similar to the Kleene-star, but not thesame. If we write 2.(72)∗, we mean the language of finite words

{2, 2.72, 2.727272, . . .} .

In the same way as one may conveniently define rational number (when represented inthe alphabet of the decimals) as those which can be written using periodic expressions(using for instance overline), regular languages over an alphabet are simply those sets offinite words that can be written by regular expressions (see later). Actually, there aredeeper connections between regular languages and rational numbers, but it’s not the topicof compiler constructions. Suffice to say that it’s not a coincidence that regular languagesare also called rational languages (but not in this course).

Regular expressions

Definition 2.2.3 (Regular expressions). A regular expression is one of the following

1. a basic regular expression of the form a (with a ∈ Σ), or ε, or ∅2. an expression of the form r | s, where r and s are regular expressions.3. an expression of the form r s, where r and s are regular expressions.4. an expression of the form r∗, where r is a regular expression.

Precedence (from high to low): ∗, concatenation, | By “concatenation”, the third pointin the enumeration is meant. It is written or represented without explicit concatenationoperator, just as juxtaposition, like ab is the concatenation of the characters a and b, andalso for concatenating whole words: w1 w2.

Regular expressions

In Cooper and Torczon [6], ∅ is not part of the regular expressions. For completeness sakeit’s included here even if it does not play a practically important role.

In other textbooks, also the notation + instead of | for “alternative” or “choice” is aknown convention. The | seems more popular in texts concentrating on grammars. Later,we will encounter context-free grammars (which can be understood as a generalization ofregular expressions) and the |-symbol is consistent with the notation of alternatives in thedefinition of rules or productions in such grammars. One motivation for using + elsewhereis that one might wish to express “parallel” composition of languages, and a conventionalsymbol for parallel is |. We will not encounter parallel composition of languages in thiscourse. Also, regular expressions using lot of parentheses and | seems slightly less readablefor humans than using +.

Regular expressions are a language in itself, so they have a syntax and a semantics. Onecould write a lexer (and parser) to parse a regular language. Obviously, tools like parsergenerators do have such a lexer/parser, because their input language are regular expression(and context free grammars, besides syntax to describe further things). One can see regularlanguages as a domain-specific language for tools like (f)lex (and other purposes).


A “grammatical” definition

Later introduced as (notation for) context-free grammars:

r → ar → εr → ∅r → r | rr → r rr → r∗

(2.4)

We will see enough context-free grammars of the form given on this slide and the following.They will be central to parsing and their definition and format will be explained in detailthere. Here, it’s in “preview”, we use the context-free grammar notation (known as EBNF)to describe one particular notation, namely the notation known as regular expressions.

Same again

Notational conventions

Later, for CF grammars, we use capital letters to denote “variables” of the grammars(then called non-terminals). If we like to be consistent with that convention in the parsingchapters and use capitals for non-terminals, the grammar for regular expression looks asfollows:

Grammar

R → aR → εR → ∅R → R | RR → RRR → R∗

(2.5)

Symbols, meta-symbols, meta-meta-symbols . . .

• regexprs: notation or “language” to describe “languages” over a given alphabet Σ(i.e. subsets of Σ∗)

• language being described ⇔ language used to describe the language⇒ language ⇔ meta-language• here:

– regular expressions: notation to describe regular languages– English resp. context-free notation: notation to describe regular expressions (a

notation itself)• for now: carefully use notational or typographic conventions for precision


To be careful: we will later (when dealing with parsers) distinguish between context-freelanguages on the one hand and notations to denote context-free languages on the other.

In the same manner here: we now don’t want to confuse regular languages as conceptfrom particular notations (specifically, regular expressions) to write them down.

Notational conventions

• notational conventions by typographic means (i.e., different fonts etc.)• you need good eyes, but: difference between

– a and a– ε and ε– ∅ and ∅– | and | (especially hard to see :-))– . . .

• later (when gotten used to it) we may take a more “relaxed” attitude towards it,assuming things are clear, as do many textbooks.

Remark 4 (Regular expression syntax). We are rather careful with notations and meta-notations, especially at the beginning. Note: in compiler implementations, the distinctionbetween language and meta-language etc. is very real (even if not done by typographicmeans as in the script here or textbooks . . . ): the programming language being implementedneed not be the programming language used to implement that language (the latter wouldbe the “meta-language”). For example in the oblig: the language to implement is called“Compila”, and the language used in the implementation will (for most) be Java. Bothlanguages have concepts like “types”, “expressions”, “statements”, which are often quitesimilar. For instance, both languages support an integer type at the user level. But one isan integer type in Compila, the other integers at the meta-level.

Later, there will be a number of examples using regular expressions. There is a slight“ambiguity” about the way regular expressions are described (in this slides, and elsewhere).It may remain unnoticed (so it’s unclear if I should point it out here). On the other had,the lecture is, among other things, about scanning and parsing of syntax, therefore it maybe a good idea to reflect on the syntax of regular expressions itself.

In the examples shown later, we will use regular expressions using parentheses, like forinstance in b(ab)∗. One question is: are the parentheses ( and ) part of the definitionof regular expressions or not? That depends a bit. In the presentation here typically onewould not care, one tells the readers that parentheses will be used for disambiguation,and leaves it at that (in the same way one would not bother to tell the reader that it’sfine to use “space” between different expressions (like a | b is the same expression asa | b). Another way of saying that is that textbooks, intended for human readers, givethe definition of regular expressions as abstract syntax as opposed to concrete syntax.Those two concepts will play a prominent role later in the grammar and parsing sectionsand will become clearer then. Anyway, it’s thereby assumed that the reader can interpretparentheses as grouping mechanism, as is common elsewhere, as well, and they are leftout from the definition not to clutter it.


Of course, computers and programs (i.e., in particular scanners or lexers), are not as goodas humans to be educated in “commonly understood” conventions (such as the instructionfor the reader that “paretheses are not really part of the regular expressions but can beadded for disambiguation”.) Abstract syntax corresponds to describing the output of aparser (which are abstract syntax trees). In that view, regular expressions (as all notationrepresented by abstract syntax) denote trees. Since trees in texts are more difficult (andspace-consuming) to write, one simply use the usual linear notation like the b(ab)∗ fromabove, with parentheses and “conventions” like precedences, to disambiguate the expression.Note that a tree representation represents the grouping of sub-expressions in its structure,so for grouping purposes, parentheses are not needed in abstract syntax.

Of course, if one wants to implement a lexer or to use one of the available ones, one hasto deal with the particular concrete syntax of the particular scanner. There, of course,characters like ′(′ and ′)′ (or tokens like LPAREN or RPAREN) will typically occur.

To sum up the discussion: Using concepts which will be discussed in more depth later, onemay say: whether paretheses are considered as part of the syntax of regular expressionsor not depends on the fact whether the definition is wished to be understood as describingconcrete syntax trees or abstract syntax trees!

See also Remark 5 later, which discusses further “ambiguities” in this context.

Same again once more

R → a | ε | ∅ basic reg. expr.| R | R | RR | R∗ compound reg. expr.

(2.6)

Note:

• symbol | : (bold) as symbol of regular expressions• symbol | : (normal, non-bold) meta-symbol of the CF grammar notation• the meta-notation used here for CF grammars will be the subject of later chapters• this time: parentheses “added” to the syntax.

This is just a more “condensed” representation of the grammar we have seen before. Wewill see many examples later when discussing context-free grammars. In particular note thetwo “different” versions of the | symbol: one as syntactic element for regular expressions,one as symbol used in context-free grammars on the meta-level, used to describe the syntaxof regular expressions. Though these levels are clearly separated, the intended meaning ofthe symbol is kind of the same, it represents “or”.

Semantics (meaning) of regular expressions

Definition 2.2.4 (Regular expression). Given an alphabet Σ. The meaning of a regexpr (written L(r)) over Σ is given by equation (2.7).


L(∅) = {} empty languageL(ε) = {ε} empty wordL(a) = {a} single “letter” from ΣL(rs) = {w1w2 | w1 ∈ L(r), w2 ∈ L(s)} concatenationL(r | s) = L(r) ∪ L(s) alternativeL(r∗) = L(r)∗ iteration

(2.7)

• conventional precedences: ∗, concatenation, |.• Note: left of “=”: reg-expr syntax, right of “=”: semantics/meaning/math 2

Explanations

The definition may seem a bit over the top. One could say, the meaning of the regularexpression is clear enough when described in simple prose. That may actually be the case.But it actually means, regular expressions and the meaning of an expression, which is theset of words it describes, is likewise straightforward. Nonethless, we make the “effort”to define the meaning. First of all, precision does not hurt, within a compiler lectureand outside. In other situations, the question of “what does it mean”, i.e., the questionof semantics, become more pressing. One can ask the same question about later otherformalism, like the meaning of context-free grammars. Thirdly, in this simple situation,the description of the meaning of a language hopefully makes the different levels moreclear: the syntactic level (symbols) and the semantic level resp. the meta-level (math). Ofcourse, “math” is a discipline which has its own symbols and notations. In this particularcase of regular expressions, they are pretty close. And of course the description of thesemantics using math assumes that the reader is familiar with those notations, so that adefinition like L(r | s) = L(r)∪L(s) is helpful or more compact than an English description.But of course, it just a way of saying “the regular expression symbol | means set union”.Indeed, another motivation is that this form of semantic definition is a form of translation,i.e., “compilation”. In this case from one notational form (regular expression) to anotherone (mathematical notation, whose meaning is assumed to be clear). Semantics andtranslations from one level of abstraction to another one are also needed for programminglanguages themselves, though we don’t go there in this lecture. For instance, in the oblig,the compila language has to be translated to a lower level. We could have specified thesemantics of compila more formally, though the definition would be much more complicated(and probably use different techniques) than the semantics of regular languages. We couldeven go more ambition: not only define the semantics of compila, but also define thesemantics of the language it is compiled to. That would be some form of “byte-code”.After having defined both levels of semantics, one could establish that both semantics dothe same. That would be the question of compiler correctness. There are attempts ofhaving a provably (!) correct compiler, or verifying compiler, though that is exceedinglycomplex, it’s seen as one of the so-called grand challenges in computer science.

2Sometimes confusingly “the same” notation.


Examples

In the following:

• Σ = {a, b, c}.• we don’t bother to “boldface” the syntax

words with exactly one b (a | c)∗b(a | c)∗words with max. one b ((a | c)∗) | ((a | c)∗b(a | c)∗)

(a | c)∗ (b | ε) (a | c)∗words of the form anban,i.e., equal number of a’sbefore and after 1 b

Another regexpr example

words that do not contain two b’s in a row.

(b (a | c))∗ not quite there yet((a | c)∗ | (b (a | c))∗)∗ better, but still not there

= (simplify)((a | c) | (b (a | c)))∗ = (simplifiy even more)(a | c | ba | bc)∗(a | c | ba | bc)∗ (b | ε) potential b at the end(notb | b notb)∗(b | ε) where notb , a | c

Remark 5 (Regular expressions, disambiguation, and associativity). Note that in theequations in the example, we silently allowed ourselves some “sloppyness” (at least for thenitpicking mind). The slight ambiguity depends on how we exactly interpret definitions ofregular expressions. Remember also Remark 4 on page 38, discussing the (non-)status ofparentheses in regular expressions. If we think of Definition 2.2.3 on page 36 as describingabstract syntax and a concrete regular expression as representing an abstract syntax tree,then the constructor | for alternatives is a binary constructor. Thus, the regular expression

a | c | ba | bc (2.8)

which occurs in the previous example is ambiguous. What is meant would be one of thefollowing

a | (c | (ba | bc)) (2.9)(a | c) | (ba | bc) (2.10)

((a | c) | ba) | bc , (2.11)

corresponding to 3 different trees, where occurences of | are inner nodes with two childreneach, i.e., sub-trees representing subexpressions. In textbooks, one generally does not wantto be bothered by writing all the parentheses. There are typically two ways to disambiguatethe situation. One is to state (in the text) that the operator, in this case |, associates to the


left (alternatively it associates to the right). That would mean that the “sloppy” expressionwithout parentheses is meant to represent either (2.9) or (2.11), but not (2.10). If onereally wants (2.10), one needs to indicate that using parentheses. Another way of findingan excuse for the sloppyness is to realize that it (in the context of regular expressions)does not matter, which of the three trees (2.9) – (2.11) is actually meant. This is specificfor the setting here, where the symbol | is semantically represented by set union ∪ (cf.Definition 2.2.4 on page 39) which is an associative operation on sets. Note that, inprinciple, one may choose the first option —disambiguation via fixing an associativity—also in situations, where the operator is not semantically associative. As illustration, usethe ’−’ symbol with the usal intended meaning of “subtraction” or “one number minusanother”. Obviously, the expression

5− 3− 1 (2.12)

now can be interpreted in two semantically different ways, one representing the result 1,and the other 3. As said, one could introduce the convention (for instance) that the binaryminus-operator associates to the left. In this case, (2.12) represents (5− 3)− 1.

Whether or not in such a situation one wants symbols to be associative or not is a judge-ment call (a matter of language pragmatics). On the one hand, disambiguating may makeexpressions more readable by allowing to omit parentheses or other syntactic markerswhich may make the expression or program look cumbersome. On the other hand, the“light-weight” and “easy-on-the-eye” syntax may trick the unsuspecting programmer intomisconceptions about what the program means, if unaware of the rules of associativity andpriorities. Disambiguation via associativity rules and priorities is therefore a double-edgedsword and should be used carefully. A situation where most would agree associativity isuseful and completely unproblematic is the one illustrated for | in regular expression: itdoes not matter anyhow semantically. Decisions concerning when to use ambiguous syntaxplus rules how to disambiguate them (or forbid them, or warn the user) occur in manysituations in the scannning and parsing phases of a compiler.

Now, the discussion concerning the “ambiguity” of the expression (a | c | ba | bc) fromequation (2.8) concentrated on the |-construct. A similar discussion could obviously bemade concerning concatenation (which actually here is not represented by a readable con-catenation operator, but just by juxtaposition (= writing expressions side by side)). Inthe concrete example from (2.8), no ambiguity wrt. concatenation actually occurs, sinceexpressions like ba are not ambiguous, but for longer sequences of concatenation like abc,the question of whether it means a(bc) or a(bc) arises (and again, it’s not critical, sinceconcatenation is semantically associative).

Note also that one might think that the expression suffering from an ambiguity concern-ing combinations of operators, for instance, combinations of | and concatenation. Forinstance, one may wonder if ba | bc could be interpreted as (ba) | (bc) and b(a | (bc)) andb(a | b)c. However, in Definition 2.2.4 on page 40, we stated precedences or priorities,stating that concatenation has a higher precedence over |, meaning that the correct inter-pretation is (ba) | (bc). In a text-book the interpretation is “suggested” to the reader by thetypesetting ba | bc (and the notation it would be slightly less “helpful” if one would writeba|bc. . . and what about the programmer’s version a b|a c?). The situation with prece-dence is one where difference precedences lead to semantically different interpretations.


Even if there’s a danger therefore that programmers/readers mis-interpret the real mean-ing (being unaware of precedences or mixing them up in their head), using precedences inthe case of regular expressions certainly is helpful, The alternative of being forced to write,for instance

((a(b(cd))) | (b(a(ad)))) for abcd | baad

is unappealing even to hard-core Lisp-programmers (but who knows ...).

A final note: all this discussion about the status of parentheses or left or right assocativityin the interpretation of (for instance mathematical) notation is mostly is over-the-top formost mathematics or other fields where some kind of formal notations or languages areused. There, notation is introduced, perhaps accompanied by sentences like “parenthesesor similar will be used when helpful” or “we will allow ourselves to omit parentheses ifno confusion may arise”, which means, the educated reader is expected to figure it out.Typically, thus, one glosses over too detailed syntactic conventions to proceed to the moreinteresting and challenging aspects of the subject matter. In such fields one is furthermoresometimes so used to notational traditions (“multiplication binds stronger than addition”),perhaps established since decades or even centuries, that one does not even think aboutthem consciously. For scanner and parser designers, the situation is different; they arerequested to come up with the notational (lexical and syntactical) conventions of perhapsa new language, specify them precisely and implement them efficiently. Not only that:at the same time, one aims at a good balance between expliciteness (“Let’s just forcethe programmer to write all the parentheses and grouping explicitly, then he will get lessmisconceptions of what the program means (and the lexer/parser will be easy to writefor me. . . )”) and economy in syntax, leaving many conventions, priorities, etc. implicitwithout confusing the target programmer.

Additional “user-friendly” notations

r+ = rr∗

r? = r | ε

Special notations for sets of letters:

[0− 9] range (for ordered alphabets)~a not a (everything except a). all of Σ

naming regular expressions (“regular definitions”)

digit = [0− 9]nat = digit+

signedNat = (+|−)natnumber = signedNat(”.”nat)?(E signedNat)?

The additional syntactic constructs may come in handy when using regular expressions,but they don’t extend the expressiveness of the formalism. That’s pretty obvious by theway the extensions are defined. Note that we don’t explain the meaning or semantics ofthe new constructs in the same way as for the core constructs (defining L and giving their


mathematical interpretation). Instead, we expand the new constructs and express themin terms of the old syntax. They are treated as syntactic sugar, as one says.

Tools, utilities, and libraries working with regular expression (like lex) typically supportsugared versions, though the exact choice of notation for the construct may vary.

As mentioned, there are also so called extended regular expresions, where the extensionsmake the formalism more expressive than the core formalism (so those extensions are notsyntactic sugar then).

One could look at the collection of constructors for the syntax of regular language, in-cluding the sugar, and wonder whether there aren’t some missing. For example, we havein the language a form of “or” (disjunction), written |, one could ask, why not an “and”(conjuction, intersection), for instance. That’s indeed an interesting, insofar it is an exam-ple which is not syntactic sugar on the one hand, but on the other hand does not extendthe expressiveness for real. If one had regular expressions containing an “and”, then onecan alway find a different regular expression with the same meaning, without the “and”.However, the transformation would not be of the same nature than for the syntactic sugarwe added: the conjunction cannot just be expanded away; consequently one would notcall that addition syntactic sugar. There exists other constructs that are non-sugar butdo not add expressiveness (negation or complementation for example).

Mostly, such constructs like intersection or complementation are not part of the regularexpression syntax, though theoretically, one would not leave the class of regular languages(= languages that can be expressed by regular expressions). Why are those then left out?It’s probably a matter of pragmantics. One does not really need them for many things onewant to do with regular expression, like describing lexical aspects of a language for a lexer.One wants to classify strings, and one is content by saying “It’s whitespace (which is is thisor this or this), or it’s a number, or it’s an identifier, or a bracket . . . ”. Given also the factthat adding conjuction or negation or other non-sugar ingredient would make the somefollowing constructions more complex, there is no real motivation to support conjunction.By the “following constructions” I mean basically the translation of a regular expressionsinto a (non-deterministic) finite state automaton. This translation, called Thompson’sconstruction, will be covered later in this chapter. A tool like lex does this construction(followed by other steps). The construction is fairly simple, but if one conjuctions andcomplementations would drive up the size of the resulting automata. For intersection,for instance, one would needed a form of product construction, which is also conceptuallymore complex than the straightforward, compositional algorithm underlying Thompon’sconstruction. Actually, it would not be so bad, since if one avoid using conjuction ornegation, the size of the result would not blow up, so the reason, why regular expressionsdon’t typically support those more complex operators is that pragmatically, no one missesthem for the task at hand, at least not for lexers.

Ordered alphabet

We have defined an alphabet as a (finite) set of symbols. In practice, alphabets or charactersets are not just sets, which are unordered, but are seen as ordered. Each symbol of thealphabet has a “number” associated to it (a binary pattern) which corresponds to its placein order in the sequence of symbols. One of the simplest and earliest established ordered


Figure 2.1: ASCII reference card

alphabets in the context of electronic computers is the well-known ascii alphabet. SeeFigure 2.1.

Having the alphabet ordered is one thing, having a “good” order or arrangement is adifferent one. The reference card shows some welcome properties, for instance, that alllower-case letters are contiguous and in the “expected” order, same for the capital caseletters. Since the designers of ascii arranged it in that way, one can support specifyingall lower-case letters as [a-z] and capital case letters as [A-Z]. What does not workis having all letters as [a-Z], since, in ascii, the letters are not arranged like that. Thecapital letters come before the lower case letters, but also [A-z] would not work asintended, as there is a “gap” of other symbols between the lower-case and the upper-caseletters. Isn’t that stupid? Actually not, the arrangement as made clear in the figure, issuch that the operation of turning a lower-case letter to a upper-case latter a matter offlipping bits. Another rational decision is to place the decimal numbers “align” partlywith their binary representions. It’s not that 0, 1, etc. are exactly the correspondingbit patterns, but at least parts of the word correspond to the binary pattern. Anyway,details like that don’t matter too much for us, but one has to be aware of the conceptof ordered alphabets as such, in order to specify, for example, all letters as [a-zA-Z](or [A-Za-z]). Many encodings are nowadays extensions or variations of ascii, and alsofor those, specifications like [a-z] work. For instance, UTF-8. As a side remark: KenThompson (the one from Thompon’s construction) was involved in working out UTF-8,an encoding that includes ASCII insfar that it’s identical with ASCII in its first part. Ofcourse, there are very many variations of UTF (and Unicode symbol set, of which UTF isa encoding scheme),

There are however, also alternatives to ascii, not just extension. One is Extended BinaryCoded Decimal Interchange Code (EBCDIC) (actually also for EBCDIC, there are manyvariation). EBCDIC is perhaps mostly of historic interests as it is supported mainly byIBM mainframes and larger such computers. I mention EBCDIC here, because the encod-ing has the unfortunate property, that, for instance, capital letters are not contiguious.

46 2 Scanning2.3 DFA

For such an encoding, of course, things like [A-Z], make no sense. The encoding wouldhave other negative consequences, for instance, sorting a list of words is more tricky (or atleast less efficient). However, EBCDIC still lives on, there exists Unicode encodings basedon that (as opposed to based in extensions of ascii like UTF-8), which concequently arecalled UTF-EBCDIC. Once, some things are standardized, they never die out completely(and actually EBCDIC just inherits properties of punchcards, which existed before themodern electronic computer, to the new area. In that context, it’s not a coincidence thatIBM which was a big name in “punchcard processing equipment” and stuck to aspects ofthe encoding when it became a big name in electronic computers and mainframes.

2.3 DFA

In this section and the following we introduce the very central notion of finite state au-tomata and cover their close relation to regular expression. Finite state automata arewell-studied and play an important role also beyond their use for lexing. There are manydifferent variations of finite state automata, also under different names. Some of them arementioned on the slides. Such automata in their classic form are pretty simple objects,basically some graphs with labelled edges, and some nodes are singled out as start orinitial nodes and some as final or accepting nodes. What makes such “graphs” automataor machines is their operational interpretation, they are seen as mechanisms that “run” ordo steps. The nodes of the “graph” are seen as states the machine can be in. The edgesare transition. It’s assumed that “executions” of the machine starts in one of the initialstates, and when in one final state, the execution ends, more precisely, can end. The men-tal picture of some entity, being in some discrete state, starting somewhere, doing stepsor transitions one after the other is of course super-general and very unspecific. Basicallyall mechanized computing can be thought of operationally that way, going from one stateto a next one and so on.

As the name indicates, specific here is that the number of states are finite. That’s a strongrestriction. Finite state automata are an important model of computation. It is also amodel for hardware circuits, more specificially discrete, “boolean” circuits not analoguehardware. It’s clear that a binary circuit can be only in a finite number of states, and finitestate machine are a good model for describing such hardware. The automata in that caseare a bit more elaborate than the ones we use here, in particular, one would use automatathat don’t have a unstructured alphabet, but one would conceptually distinguish betweeninput and output (though possible on the same alphabet). There are different ways one cando that. Typically, the edges carry the output, whereas on can connect the output to thestates, or alternatively to the edges, as well. Those two styles of finite-state input/outputautomata are called Moore-machines (= output on the state) resp. Mealy-machines (=output on the transitions). The two different models would also require different styles ofhardware realization, but those things are not important for us.

For lexing, we are handling automata with an unstructured alphabet, without distinguish-ing input from output. Such single-alphabet automaton can be “mentally seen” that theedges generate the letters (i.e., the letters are the output). With this view, a given au-tomaton generates a language, i.e., the set of all sequences of letters that lead from aninitial to an accepting state. Alternatively, one can see the letters on the edges as input;

2 Scanning2.3 DFA 47

in this view, such machines are seen as recognizers or acceptors. The final states of anautomaton are also called accepting states. Anyway, that view of acceptors is also theappropriate one for lexing or scanning. The letters of the alphabet are the characters fromthe input and the machine moves along and accepts a word (a lexeme of the languagebeing scanned), and the accepting state corresponds to the token classes (for instance, anindentifier, or a number etc).

Coming back to the issue of finite state I/O automata we brushed. Actually, the lexer inthe context of a compiler can be seen as involving input and output. The characters arethe input, and the token (token class and token value) are outputs and parsing a file meansmaking iterated use of that arrangement, handing over a token stream to the parser.

We (as basically all compiler books) focus in the classic theory of finite state automata,ignoring as far as the theory is concerned, the token-output part. This is also the part,which connects with the regular expression from before. Regular expressions specify thelexical aspects of the language, and finite state automata are the execution mechanismto accept the corresponding lexemes. If course, concretely, tools like lex need to arrangealso for the token-output part, but if one has the input-part under control, there is notmuch to understand there.

One aspect that is important is the question determinism vs non-determinism. Determismin computational situations mean: there is (at most) one next state. Non-determinismmeans, there is potentially more than one, the future is not determined. For finite stateautomata, it’s more precisely like as follows: Given a state and given an (input) symbol,say a, there is (at most) one successor reachable via an a-transition. One can also say:there is at most one a-successor. In other words, the current state and the input determinesthe next state (if any).

That’s highly-desirable in a lexer: the lexer scans one letter after the other, and itsnot supposed to make guesses how to proceed. Doing so would lead do the danger ofbacktracking: in case the guess turns out to rejecting the input later down the line thelexer has to try to explore alternatives to find out if any of this could lead to acceptingthe input nonetheless. That’s a horrible way to scan the input.

The good news is: one can avoid that. Intuitively the way to do it is to replace annon-determistic automaton by a different, but equivalent one, that conceptually exploresall alternatives “at the same time”. The determinization algorithm is known as powersetconstruction and is pretty straightforward and pretty natural.

Determinisation of automata-like formalisms

As some side remark concening the naturalness of the determinization procedure, or aword of cautioning. It’s true, it’s natural. However, strangely perhaps, it works not“universally”. For instance, there are other automata-based formalisms that look quitesimilar. One such is finite-state automata that doen’t work in finite words (as we do) butinfinite words. Or finite-state automata that work in trees (either working top-down orbottom-up). We will not encounter those. What we will encounter, though, is a particularform of “infinite state automaton” known as push-down automaton. Those, by havingan infinite about of memory, there are more expressive than finite state automata. That


are central for parsing (not lexing) of context-free languages. The amount of memory forpush-down automata is infinite, but not “random access”, i.e. one can only access the topof a stack (by pushing and popping content), and this restriction fits with context-freelanguages (in the same way that the finite-state restriction fits with regular languages).

Anyway, for all those automata-like constructions, there are deterministic and non-deterministicvariants in that the respective input determines their reaction or not. However, the pow-erset construction would not work for those, which means, non-deterministic versions forthose are strictly more expressive than deterministic ones (with except of bottom-up treeautomata where determinism vs. non-determinism does not matter as for finite-state ma-chine we are dealing with). Perhaps also interesting: for Turing machines, which can beseen as machines with finite control and infinite amount of random access memory (notjust a stack), again determinism is not a restriction.

All that is meant just as a cautening not to assume that the powerset consumption canbe transported “obviously” to other settings. . .

Finite-state automata

• simple “computational” machine• (variations of) FSA’s exist in many flavors and under different names• other well-known names include finite-state machines, finite labelled transition sys-

tems, . . .• “state-and-transition” representations of programs or behaviors (finite state or else)

are wide-spread as well– state diagrams– Kripke-structures– I/O automata– Moore & Mealy machines

• the logical behavior of certain classes of electronic circuitry with internal memory(“flip-flops”) is described by finite-state automata.

Historically, the design of electronic circuitry (not yet chip-based, though) was one of theearly very important applications of finite-state machines.

Remark 6 (Finite states). The distinguishing feature of FSA (as opposed to more powerfulautomata models such as push-down automata, or Turing-machines), is that they have “finitely many states ”. That sounds clear enough at first sight. But one has too be a bit morecareful. First of all, the set of states of the automaton, here called Q, is finite and fixed fora given automaton, all right. But actually, the same is true for pushdown automata andTuring machines! The trick is: if we look at the illustration of the finite-state automatonearlier, where the automaton had a head. The picture corresponds to an accepting useof an automaton, namely one that is fed by letters on the tape, moving internally fromone state to another, as controlled by the different letters (and the automaton’s internal“logic”, i.e., transitions). Compared to the full power of Turing machines, there are tworestrictions, things that a finite state automaton cannot do

• it moves on one direction only (left-to-right)• it is read-only.

https://en.wikipedia.org/wiki/Saul_Kripke


All non-finite state machines have some additional memory they can use (besides q0, . . . , qn ∈Q). Push-down automata for example have additionally a stack, a Turing machine is al-lowed to write freely (= moving not only to the right, but back to the left as well) on thetape, thus using it as external memory.

FSA

Definition 2.3.1 (FSA). A FSA A over an alphabet Σ is a tuple (Σ, Q, I, F, δ)

• Q: finite set of states• I ⊆ Q, F ⊆ Q: initial and final states.• δ ⊆ Q× Σ×Q transition relation

• final states: also called accepting states• transition relation: can equivalently be seen as function δ : Q × Σ → 2Q: for each

state and for each letter, give back the set of sucessor states (which may be empty)• more suggestive notation: q1

a−→ q2 for (q1, a, q2) ∈ δ• we also use freely —self-evident, we hope— things like

q1a−→ q2

b−→ q3

The definition given is fairly standard and whether one see δ as relation of function is, ofcourse, equivalent. One often uses graphical representations to illustrate such automata;we will encounter numerous examples.

FSA as scanning machine?

• FSA have slightly unpleasant properties when considering them as decribing an actualprogram (i.e., a scanner procedure/lexer)

• given the “theoretical definition” of acceptance:

The automaton eats one character after the other, and, when reading a letter, it moves toa successor state, if any, of the current state, depending on the character at hand.

• 2 problematic aspects of FSA– non-determinism: what if there is more than one possible successor state?– undefinedness: what happens if there’s no next state for a given input

• the 2nd one is easily repaired, the 1st one requires more thought• [6]: recogniser corresponds to DFA

Non-determinism

We touched upon the issue in the introduction of the chapter already: non-determinismis “problematic”. One could try backtracking, but, you definitel don’t want that in ascanner. And even if you think it’s worth a shot: how do you scan a program directlyfrom magnetic tape, as done in the bad old days? Magnetic tapes can be rewound, ofcourse, but winding them back and forth all the time destroys hardware quickly. Howshould one scan network traffic, packets etc. on the fly? The network definitely cannot be


rewound. Of course, buffering the traffic would be an option and doing then backtrackingusing the buffered traffic, but maybe the packet-scanning-and-filtering should be donein hardware/firmware, to keep up with today’s enormous traffic bandwith. Hardware-only solutions have no dynamic memory, and therefore actually are ultimately finite-statemachine with no extra memory. As hinted at in the introducton: there is a way to turna non-deterministic finite-state automaton into a deterministic version. We start by firstdefining the concept of determinism, resp. what constitutes a deterministic automaton

DFA: deterministic and total automata

Definition 2.3.2 (DFA). A deterministic, finite automaton A (DFA for short) over analphabet Σ is a tuple (Σ, Q, I, F, δ)

• Q: finite set of states• I = {i} ⊆ Q, F ⊆ Q: initial and final states.• δ : Q× Σ→ Q transition function

• transition function: special case of transition relation:– deterministic– left-total (“complete”)

Depending on which text one consults, the definition of DFA slightly disagrees. It’s nota fundamental disagreement, it’s more a question of terminology. It concern the notionof determinism, namely if one thinks whether being a deterministic automaton includes“totality” of the transition relation/transition function or not. Or in other words: foreach state and each letter a, is there exactly one a-successor or at most one. One couldmake the argument, determinism means the latter: at each state, and for each input, thereaction is fixed: one either moves to one particular successor state, or else is “stuck.” Thatcorresponds to a definition where δ is a partial function, unlike the definition given, whereδ is a total function. So, our definition of DFA means, the automaton is deterministic andtotal. Some would say a deterministic finite state automaton need not be total (being aseparate aspect the automaton enjoys or not). But no one would say, a automaton whohas only a partial successor function is non-determistic.

Actually, it’s a terminology question and does not matter much, basically it says: A DFAis a determistic and total finite state automaton (but we won’t bother to call it DTFAor something). The reason why it does not matter much is that really there is no muchdifference anyway. A automaton with a partial transition function can always be completedinto a total one by adding an extra non-accepting state, covering the situations when thepartial automaton would otherwise be “stuck”. That’s so obvious, that one need bothertalk about it much. Also later, when showing graphical representation of automata: whentalking about DFA (and when we want to really stress that they are total), we still mightleave out to show the extra state in the figure, it’s just assumed that one understands thatit’s there.

As far as implementations of automata is concerned (for instance for lexing purposes): the“partial transition function” is also not too realistic. If the lexer eats one symbol which,at that point, is illegal, and for which there is no successor state, the lexer (and the overallcompiler) whould not simply stop or deadlock. It will eat the symbol and inform the


surrounding program (the parser, the compiler) that this situation occured. It’s indicatesa form of error (a lexical error in the input), since we are dealing with an deterministicautomaton, so there cannot be an alternative reading of the input that would have avoidedthat the lexer is stuck (or moved to a non-accepting state, or raised an exception etc).So, turning an automaton into a “total” or “complete” one is a non-issue, but removingnon-determinism from an automaton is an issue. We will discuss determinization later.

For a relation, being left-total means, for each pair q, a from Q × Σ, δ(q, a) is defined.When talking about functions (not relations), it simply means, the function is total, notpartial.

Some people call an automaton where δ is not a left-total but a deterministic relation (or,equivalently, the function δ is not total, but partial) still a deterministic automaton. Inthat terminology, the DFA as defined here would be deterministic and total.

Meaning of an FSA

The intended meaning of an FSA over an alphabet Σ is the set of all the finite words,the automaton accepts.

Definition 2.3.3 (Accepted words and language of an automaton). A word c1c2 . . . cnwith ci ∈ Σ is accepted by automaton A over Σ, if there exists states q0, q2, . . . , qn from Qsuch that

q0c1−→ q1

c2−→ q2c3−→ . . . qn−1

cn−→ qn ,

and were q0 ∈ I and qn ∈ F . The language of an FSA A, written L(A), is the set of allwords that A accepts.

FSA example

The figure shows convention we will use throughout. The initial states, in this case onlyone, are marked by a “incoming” arrow. The accepting states, in this case also only one,is marked by having a double ring. The automaton is deterministic, though not complete.We could interpret the automaton as complete DFA, with one extra non-accepting state.

q0 q1 q2

a

b

a

b

c


Example: identifiers

identifier = letter(letter | digit)∗ (2.13)

start in_idletter

letter

digit

start in_id

error

letter

other

letter

digitother

any

The example shows an automaton for identifiers, as they could appear in this or similarform in the lexer specifications for typical programming languages. They are specified us-ing regular expressions, so it’s also an illustration how regular expression can be translatedinto an automaton. The exact construction (which will be presented in three stages) willbe covered later in this chapter, but the example is so simple, that one can easily come upwith a deterministic automaton corresponding to the regular expression. There are twoversions shown, one which is incomplete, and the second one with an extra state added(called error, but names of the states don’t matter, they are only there to help the humanreader).

Automata for numbers: natural numbers


(2.14)


digit

digit

One might say, it’s not really the natural numbers, it’s about a decimal notation of naturalnumbers (as opposed to other notations, for example Roman numeral notation). Notealso that initial zeroes are allowed here. It would be easy to disallow that. Anotherremark: we make use of some user-friendly aspect supported in many applied versions ofregular expression, some form of syntactic sugar. That’s the possiblity to use definitionsor abbreviations. We give a name to the regular expression [0− 9] and that abbreviationdigit is used for defining nat. That certainly makes regular expressions more readable, andwe will continue that form of building larger concepts from simpler ones in the followingslides.

Also [0 − 9] and the + symbol are syntactic sugar on top of the core regular expressionsyntax.

Signed natural numbers

signednat = (+ | −)nat | nat (2.15)

+−

digit

digit

digit

Again, the automaton is deterministic (but not total). It’s easy enough to come upwith this automaton, but the non-deterministic one (on the next slide) is probably morestraightforward to come by with. Basically, one informally does two “constructions”, the“alternative” which is simply writing “two automata”, i.e., one automaton which consists ofthe union of the two automata, basically. In this example, it therefore has two initial states(which is disallowed obviously for deterministic automata). Another implicit constructionis the “sequential composition”.


Signed natural numbers: non-deterministic

+−

digit

digit

digit

digit

The automaton is non-deterministic by the fact that there are two initial states.

Fractional numbers

frac = signednat(”.”nat)? (2.16)

+−

digit

digit

digit

. digit

digit

Note the “optional” clause in the regular expression and the corresponding fact that the au-tomaton has multiple accepting states. Note that this does not count as non-determinism.If one considers the automaton as some abstract form describing a scanner, one couldmake the argument that there is non-determinism involved. It’s true that, if one givesthe automaton a sequence of letters, that determines its end-state (and thus whether theword is accepted or not). However, lexer’s task will also have to segment the input anddecide when a word if done (and then tokenized) and when not. In the given automaton,after having reached the first accepting state, one can make the argument that, if there isa dot following, the automaton has to make the decision whether to accept the word orto continue. That sounds like a non-deterministic choice (actually, seen like that it wouldbe a non-determistic choice).

It just means, an deterministic automaton is not in itself a scanner. A scanner dealsrepeateadly with accepting words (and segmenting), a finite state automaton is dealingonly with the question, whether a given word (already segmented, so to say) is acceptableor not. The current section deals just with acceptance or rejection of one word, and also


the standard, classical definition of determisim for automata is only concerned with thatquestion.

Floats


signednat = (+ | −)nat | natfrac = signednat(”.”nat)?

float = frac(E signednat)?

(2.17)

• Note: no (explicit) recursion in the definitions• note also the treatment of digit in the automata.

DFA for floats

+−

digit

digit

digit

.E

digit

digit

E

+−

digit

digit

digit

DFAs for comments

Pascal-style

{

other

}

56 2 Scanning2.4 Implementation of DFAs

C, C++, Java

/ ∗

other∗

∗

other

/

2.4 Implementation of DFAs

DFAs underly the implentation of lexers. The notion as such is simple enough, but aconcrete lexer has to cover slightly more things than the purely theoretical coverage sofar. One is that the lexer needs to be coupled up with the parser, feeding it with one tokenafter the other. I.e., in the implementation, the “automaton” has not just sequences ofcharacters as input, but also sequences of tokens as output. Related to that, the lexer isnot just the implementation of one DFA, but it’s a loop that repeatedly invokes the DFAa.Another aspect of the regular expressions resp. the DFA is the need for “priorities”. Wehave mentioned the issue when discussing regular expression, for instance, when confrontedwith the string <=, then that is conventionally scanned as less-or-equal, and not as a <followed by a =.

That aspect of “longest scan” is not really covered by the notion of DFA (or non-deterministicautomaton as well). That has to do with the way, automata “accept” words. They startin their initial state, eat through the input, but when it come to acceptance in a lexer, itmisleading to think like that: if you hit an accepting state, accept the word (and return thecorresponding token). Sure, if a automaton hits an accepting state, this word seens so faris accepted, resp. belongs to the language the automaton describes. But it’s presumablythe only word, so there may be another run of the automaton (even if it is a deterministicone and is fed the same word as prefix), that reaches the accepting state, and then con-tinues, perhaps accepting later down the road a longer word which extends the one theautomaton could accept right now. Of course there may not be a guarantee that thereexists a longer word. Anyway, a lexer explores one word only and makes decisions “onthe spot” (being deterministic), preferring longer scans over shorter. In case of hitting anaccepting state, it checks if one can proceed, still accepting the (extended) word. If notit accepts the word as is. This priority of proceed as long as possible and favoring longerwords over shorter prefixes is not directly covered by the theoretical treatment of FSA sofar, but it’s an aspect an implementation would have to do.

The section brushes also on data structures one can use to implement DFAs, resp. scanners.We don’t go too deep, bascially we sketch how one could use tables to represent automata(which can be realized for instance by two-dimensional arrays or in other ways). Toolslike lex allows the compiler writer to ignore details there, as that’s what those tools do:generate appropopriate data structures representing the DFA, taking care also of the otheraspects mentioned, and interfacing with the parser component of a compiler. Later in this

2 Scanning2.4 Implementation of DFAs 57

section, we introduce a notation labelling edges of the DFA with [ and ], for instance,writing [other]. The meaning will be, that is a way to describe the “longest match”discipline. For instance, assume an automaton designed to accept a word defined as asequence of letters; that could be described as [a − zA − Z]∗, making use of ranges inordered alphabets. The edge-label other will be used to abbreviate all “other” symbols.More technically, in a state with labels on outgoing edges labeled by some symbols, anoutgoing edge labelled other represents all symbols not covered by the outgoing edges.It should be self-evident, especially for a determistic automaton, that can be only oneoutgoing edge labeled other . So far, that has nothing yet to do with the point discussedhere, namely prioritizing longer matches. To do that, one uses edges, annotated with [and ], as said. Note: it may be unfortunate, but the notation is meant to do somethingelse than defining a letter as regular expression [a − zA − Z]∗. What is meant then?Well, a transition labeled [a] means: if a is next, move to the next state /without actuallyconsuming or “eating” the a as input. Concretely, we use often transitions labelled [other ],and moving to an accepting state. That is the way to represent the longest match. Wecontinue eating symbols, like lower and upper-case letters, but without accepting the stringas yet. When we hit a symbol other than a letter, we proceed and accept the string, butthe very last symbol is not part of the word we just processed.

Implementation of DFA (1)

Remember the DFA for a possible form of identifiers from earlier, in connection withequation (2.13). We had two versions, an incomplete one and complete one with an extra“error” state.

start in_id finishletter

letter

digit

[other ]

This one is deterministic, but it’s not total or complete. The transition function is onlypartial. The “missing” transitions are often not shown (to make the pictures more com-pact). It is then implicitly assumed, that encountering a character not covered by atransition leads to some extra “error” state (which simply is not shown).

The [] around the transition other at the end means that the scanner does not move forwardon the input there (but the automaton proceeds to the accepting state). That is somethingthat is not 100% in the “mathematical theory” of FSA, but is how the implementationin the scanner will behave. Note also that the accepting state has changed: we havean extra state what we move to by the special kind of transition [other ]. As the name

58 2 Scanning2.4 Implementation of DFAs

implies, “other” means all symbols different from the ones already covered by the otheroutgoing edges. This is used to realized the longest prefix: The shown DFA not justaccepts “some” identifier it spots on the input, i.e., an arbitrary sequence of letters anddigits (starting with a letter). No, it takes as many letters and digits as possible untilit encounters a character not fitting the specification but not earlier. Only at that pointdoes the automaton accepts but without advancing the input, in that this character willhave to be scanned and classified as the “next chunk” and this “the next automaton”.

Implementations

The following shows rather “sketchy” pseudo-code about how part of a lexer can be pro-grammed or represented. It’s one loop and it represents how to accept one lexeme. Asmentioned at the beginning of this section, the task of a lexer in the context of a compileris to repeatedly accept one lexeme after the other (or reject an input and stop) and handover a corresponding stream of tokens. This need of repeated acceptance does not meanthat there is another loop around the while-loop shown in the pseudo-code. At least thementioned outermost loop is not part of the lexer. The lexer and the parser work handin hand, and often that’s arranged in that lexer works “on demand” from the parser: theparser invokes the lexer “give me a new token”, the lexer has of course remembered theposition in the input from the last invokation, and, starting from there tries to determinethe next lexeme and, if successful, gives back the corresponding token to the parser. Theparser determines if, at that point in the parsing process, the token fits to the syntacticdescription of the language, and if so, adds a next piece (at least implicitly) in buildingthe parse tree, and then asks the lexer for the next token etc. The pseudo-codes of thelexer therefore contain only one loop, the one for accepting one word.

DFA implementation: explicit state representation

s t a t e := 1 { s t a r t }while s t a t e = 1 or 2do

case s t a t e of1 : case input cha rac t e r of

l e t t e r : advance the input ;s t a t e := 2

else s t a t e := . . . . { e r r o r or other } ;end case ;

2 : case input cha rac t e r ofl e t t e r , d i g i t : advance the input ;

s t a t e := 2 ; { a c t ua l l y une s s e s sa ry }else s t a t e := 3 ;end case ;

end case ;end while ;i f s t a t e = 3 then accept else e r r o r ;

2 Scanning2.5 NFA 59

The state is here represented as some integer variable. The reaction of the autmaton isa nested case switch, first one which state one is in, and secondly on which input. Onecould of course also do the “case nesting” the other way around, or making one flat caseswitch, with all combinations of state and input on the same level. We also see that the“error” state in the complete DFA is also represented in some form. At any rate, there isa else-case if the previous case(s) don’t match. In the following slides, we show how thedecision-information can be “centralized” in one table. In the table, empty slots representmissing reactions, i.e., the move to an error state.

Table rep. of the DFA

aaaaaaaastate

inputchar letter digit other accepting

1 2 no2 2 2 [3] no3 yes

added info for

• accepting or not• “ non-advancing ” transitions

– here: 3 can be reached from 2 via such a transition

Table-based implementation

s t a t e := 1 { s t a r t }ch := next input cha rac t e r ;while not Accept [ s t a t e ] and not e r r o r ( s t a t e )do

while s t a t e = 1 or 2do

newstate := T[ s ta te , ch ] ;{ i f Advance [ s ta te , ch ]then ch:=next input cha rac t e r } ;s t a t e := newstate

end while ;i f Accept [ s t a t e ] then accept ;

2.5 NFA

Non-deterministic FSA

Actually, we have covered already non-deterministic finite state automata in Definition2.3.1 (where we called it just FSA). Here we kind of repeat the definition, with δ slightly dif-ferently, but equivalently represented. What we add, however, are so-called ε-transitions,

60 2 Scanning2.5 NFA

which allows the machine to move to a new state without eating a letter. That is a formon “spontaneous” move, not being triggered by the input, which renders the automatonnon-deterministic. It will turn out, that by adding this kind of transitions does not matter,as far the expressiveness of the NFAs is concerned. Why do we then bother adding them?Well, ε-transitions come in handy in some situations, in particular in the construction wewill present afterwards: how to turn a regular expression into an NFA. It’s slightly moreconventient when one allows such transitions. It’s easy to understand also why. As apreview to that construction: it will be a compositional construction. To construct theautomton for a compound regular expression, for instance for the sequential compositionr1r2, one assumes one has the automata for the component regular expressions r1 and r2,and then one glues them together with ε-transitions, i.e., connects the accepting states ofr1 with the initial states of r3. That’s pretty easy, the use if those transitions facilitates astraightforward, compositional construction.

Definition 2.5.1 (NFA (with ε transitions)). A non-deterministic finite-state automaton(NFA for short) A over an alphabet Σ is a tuple (Σ, Q, I, F, δ), where

• Q: finite set of states• I ⊆ Q, F ⊆ Q: initial and final states.• δ : Q× Σ→ 2Q transition function

In case, one uses the alphabet Σ + {ε}, one speaks about an NFA with ε-transitions.

• in the following: NFA mostly means, allowing ε transitions• ε: treated different from the “normal” letters from Σ.• δ can equivalently be interpreted as relation: δ ⊆ Q × Σ × Q (transition relation

labelled by elements from Σ).

The version of NFA presented here includes ε-transitions. Depending on the source, andnotion of NFA may or may not include such transitions. It does not matter anyhow, asfar as the expressiveness is concerned.

Finite state machines

Remark 7 (Terminology (finite state automata)). There are slight variations in the def-inition of (deterministic resp. non-deterministic) finite-state automata. For instance,some definitions for non-deterministic automata might not use ε-transitions, i.e., definedover Σ, not over Σ + {ε}. Another word for FSAs are finite-state machines. Chapter 2in [9] builds in ε-transitions into the definition of NFA, whereas in Definition 2.5.1, wemention that the NFA is not just non-deterministic, but “also” allows those specific tran-sitions. Of course, ε-transitions lead to non-determinism, as well, in that they correspondto “spontaneous” transitions, not triggered and determined by input. Thus, in the presenceof ε-transition, and starting at a given state, a fixed input may not determine in whichstate the automaton ends up in.

Deterministic or non-deterministic FSA (and many, many variations and extensions thereof)are widely used, not only for scanning. When discussing scanning, ε-transitions come inhandy, when translating regular expressions to FSA, that’s why [9] directly builds them in.

2 Scanning2.6 From regular expressions to NFAs (Thompson’s construction) 61

Language of an NFA

• remember L(A) (Definition 2.3.3 on page 51)• applying definition directly to Σ + {ε}: accepting words “containing” letters ε• as said: special treatment for ε-transitions/ε-“letters”. ε rather represents absence of

input character/letter.

Definition 2.5.2 (Acceptance with ε-transitions). A word w over alphabet Σ is acceptedby an NFA with ε-transitions, if there exists a word w′ which is accepted by the NFA withalphabet Σ + {ε} according to Definition 2.3.3 and where w is w′ with all occurrences ofε removed.

Alternative (but equivalent) intuition

A reads one character after the other (following its transition relation). If in a state withan outgoing ε-transition, A can move to a corresponding successor state without readingan input symbol.

NFA vs. DFA

• NFA: often easier (and smaller) to write down, esp. starting from a regular expression• non-determinism: not immediately transferable to an algo

a

ε

a

ε

ε

b

a

a b

b

The example is used as illustration of an NFA and a corresponding DFA. In this smallexample, it’s straightforward to come up with a deterministic version of the automaton.In a later section, we discuss a systematic way of turning an NFA to a DFA, i.e., analgorithm.

2.6 From regular expressions to NFAs (Thompson’sconstruction)

Before showing the construction itself, we show a few examples, highlighting some regularexpression and corresponding NFAs.

62 2 Scanning2.6 From regular expressions to NFAs (Thompson’s construction)

Why non-deterministic FSA?

Task: recognize :=, <=, and = as three different tokens:

return ASSIGN

return LE

return EQ

: =

< =

=

FSA (1-2)

return ASSIGN

return LE

return EQ

:

=

< =

=

What about the following 3 tokens?

return LE

return NE

return LT

< =

< >

<


Non-det FSA (2-2)

return LE

return NE

return LT

<

=

< >

<

Non-det FSA (2-3)

return LE

return NE

return LT

<

=

>

[other]

Regular expressions → NFA

• needed: a systematic translation (= algo, best an efficient one)• conceptually easiest: translate to NFA (with ε-transitions)

– postpone determinization for a second step– (postpone minimization for later, as well)

Compositional construction [11]

Design goal: The NFA of a compound regular expression is given by taking the NFA ofthe immediate subexpressions and connecting them appropriately.


Compositionality

• construction slightly3 simpler, if one uses automata with one start and one acceptingstate

⇒ ample use of ε-transitions

Compositionality

Remark 8 (Compositionality). Compositional concepts (definitions, constructions, anal-yses, translations, . . . ) are immensely important and pervasive in compiler techniques(and beyond). One example already encountered was the definition of the language of aregular expression (see Definition 2.2.4 on page 39). The design goal of a compositionaltranslation here is the underlying reason why to base the construction on non-deterministicmachines.

Compositionality is also of practical importance (“component-based software”). In connec-tion with compilers, separate compilation and (static / dynamic) linking (i.e. “compos-ing”) of separately compiled “units” of code is a crucial feature of modern programminglanguages/compilers. Separately compilable units may vary, sometimes they are calledmodules or similarly. Part of the success of C was its support for separate compilation(and tools like make that helps organizing the (re-)compilation process). For fairness sake,C was by far not the first major language supporting separate compilation, for instanceFORTRAN II allowed that, as well, back in 1958.

Btw., Ken Thompson, the guy who first described the regexpr-to-NFA construction dis-cussed here, is one of the key figures behind the UNIX operating system and thus also theC language (both went hand in hand). Not suprisingly, considering the material of thissection, he is also the author of the grep -tool (“globally search a regular expression andprint”). He got the Turing-award (and many other honors) for his contributions.

Illustration for ε-transitions

return ASSIGN

return LE

return EQ

: =

< =

=

ε

ε

ε

3It does not matter much, though.


Thompson’s construction: basic expressions

basic (= non-composed) regular expressions: ε, ∅, a (for all a ∈ Σ)

ε

a

The ∅ is slightly odd: it’s sometimes not part of regular expressions. We can see it asrepresented as the empty automaton (which has no states and which therefore was notdrawn pictorially). If it’s lacking, then one cannot express the empty language, obviously.That’s not nice, because then the regular languages are not closed under complement.Also: obviously, there exists an automaton with an empty language. Therefore, ∅ shouldbe part of the regular expressions, even if practically it does not play much of a role.

The representation of ∅ as empty automaton is ok. If we do that, however, it’s not thecase that in Thompson’s construction all automata have one start and one final state, theempty automaton would be an exception. It’s not so obvious how to represent the emptylanguage with an NFA with one initial and one accepting state (if one wants that invariantthroughout). Note, the automaton with one state which is, at the same time, initial andaccepting does not work: that accepts the language {ε}. A state without accepting statessure represent the empty languge {} but it violates the planned invariant, that there isexactly one final state. Trying two states, one the initial one and one the accepting one,but not connected by an edge does not work in the overall construction. If one constructedthe automaton for a∅b compositionally, there would be no connection from initial to finalstate.

Therefore, the invariant that the construction maintains exactly one initial and one finalstate applies to all situations except the automaton for ∅. That also means: the con-struction of compound automata illustrated the following is preseted by referring to “the”unique initual state of one or of two involved automata, and “the” unique accepting stateof each of them. To cover also the case for ∅, one would be more precise: connect theinitial resp. final states if they exist in the following way. . . . (as shown in the pictures).

If that may seem overly nitpicking: if one wants to implement the algo (perhaps as partof lex), and if one supports ∅ as syntax, one better takes care of all cases, including thecorner cases, even if they seem not so relevant in practice, I mean, who actually hasthe need to use ∅ in a lecture. However, ignoring them, assuming there is the invariantthere there always one initial state and one final state, may derail the program, for instanceleading to an uncaught nil-pointer exception (if one had stored the states using references).An alternative would be not to support ∅, which is perfectly fine in practive, but then


the “theoretician” will complain: There are then automata for which one cannot write aregular expression. So the proof of equivalence between automata and regular expressionhas corner case, which does not work. . .

Thompson’s construction: compound expressions

In the picture, by convention, the state on the left is the unique initial one, the state onthe right is the unique initial one (if they exist). By building the larger automaton, the“status” of the initial states and final states may changed, of course. For instance, in thecase of |: a new initial state and a new accepting state is introduced for the automaton, butthe initial and final states from the two component automata loose there special status,of course.

. . .r . . .sε

. . .r

. . .s

ε

ε

ε

ε

Thompson’s construction: compound expressions: iteration

. . .r

ε

ε

2 Scanning2.7 Determinization 67

Example: ab | a

Intro

Here is a small example illustrating the construction. In the exercises, there will be more.

a

a ε b

1

2 3 4 5

8

6 7

ab | a

ε

a ε b

ε

ε

a

ε

2.7 Determinization

Determinization: the subset construction

Main idea

• Given a non-det. automaton A. To construct a DFA A: instead of backtracking:explore all successors “at the same time” ⇒

• each state q′ in A: represents a subset of states from A• Given a word w: “feeding” that to A leads to the state representing all states of A

reachable via w

• powerset construction• origin of the construction: Rabin and Scott [10]

The construction, known also as powerset construction, seems straightforward enough.Analogous constructions works for some other kinds of automata, as well, but for still oth-ers, the approach does not work: For some forms of automata, non-deterministic versionsare strictly more expressive than the deterministic one, for instance for some automataworking with languages on infinite words, not finite words as here.

68 2 Scanning2.7 Determinization

Some notation/definitions

Definition 2.7.1 (ε-closure, a-successors). Given a state q, the ε-closure of q, writtencloseε(q), is the set of states reachable via zero, one, or more ε-transitions. We write qafor the set of states, reachable from q with one a-transition. Both definitions are usedanalogously for sets of states.

We often call states like q, and sets of states then Q. So the notations for the ε-closure ofa set Q of states is closeε(Q) and Qa represent the a-successors of Q.

Remark 9 (ε-closure). Louden [9] does not sketch an algorithm but it should be clear thatthe ε-closure is easily implementable for a given state, resp. a given finite set of states.Some textbooks also write λ instead of ε, and consequently speak of λ-closure. And instill other contexts (mainly not in language theory and recognizers), silent transitions aremarked with τ .

It may also be worth remarking: later, when it comes to parsing, we will encounter thephenomenon again: some steps done treating symbols from a context-free grammar will bedone “eating symbols” (for parsing, those symbols will be called “terminals” or “terminalsymbols” and correspond to tokens). Consequently, in the context of parsing and “parsingautomata” (which are supposed to be deterministic as well), we will likewise encounterthe notion of an ε-closure which is analogous to the concept here.

Transformation process: sketch of the algo

Input: NFA A over a given Σ

Output: DFA A

1. the initial state: closeε(I), where I are the initial states of A2. for a state Q in A: the a-successor of Q is given by closeε(Qa), i.e.,

Qa−→ closeε(Qa) (2.18)

3. repeat step 2 for all states in A and all a ∈ Σ, until no more states are being added4. the accepting states in A: those containing at least one accepting state of A

Note: the book Cooper and Torczon [6] uses a slightly more “concrete” formulation usinga work-list. We will encounter work-list algos also elsewhere in the lecture (for instance,later in this chapter in minimization of automata, and also for liveness analysis at theend of the lecture, though we don’t go into details. Very abstractly, a work-list is a datastructure (like a list, more generally a collection), where one keeps “work still to be done”.One works though the list, removing one piece after the other. It’s characteristic forwork-list algorithms, that one not only removes pieces of work, but doing some work mayadd other work, or re-adds a pieces of work done already. Often that connected to thetraversal of graphs (which may contain cycles). A piece of work is “treating” a node ofthe graph (or an edge, depending on how it’s organized). The node then is removed as

2 Scanning2.7 Determinization 69

“done”. However, perhaps the neighboring nodes are added, to be done. And since thereare cycles, the node we just removed may be readded later. So, when connected to graphslike that, worklist algorithm are connected with traversals of the graph. And the “list”may be actually be a stack or a queue, or some other priorities, and that influences thetraversal strategy (depth-first vs. breadth-first vs. other strategies). But as said, we don’tdig deeper into that kind of algos, neither here nor in the examples later.

Next we show a few examples. More are covered by the exercises. In the figures, weshow the resulting deterministic automata. However, we don’t show the complete or totalversion, i.e., the extra state sometimes needed to obtain a total successor function is notshown. The state can be seen as being “marked” with the empty set {}

Example ab | a

1

2 3 4 5

8

6 7

ab | a

ε

a ε b

ε

ε

a

ε

{1, 2, 6} {3, 4, 7, 8} {5, 8} ab | aa b

Example: identifiers

Remember: regexpr for identifies from equation (2.13)

1 2 3 4

5 6

9

7 8

10letter ε ε

ε

ε

letterε

ε

εdigit

ε

ε

70 2 Scanning2.8 Minimization

{1} {2, 3, 4, 5, 7, 10}

{4, 5, 6, 7, 9, 10}

{4, 5, 7, 8, 9, 10}

letter

letter

digit

digitletter

letter

digit

Figure 2.2: DFA for identifiers

Identifiers: DFA

2.8 Minimization

This then is the last stage of the construction, minimizing a DFA. It should be clear whythat is useful: less states means more compact representation (but perhaps not necessarilyspeed-up in lexing). Minimal means, with the least amount of states. It’s clear that thereexists an automaton with the least amount of states. But what perhaps more surprisingis: there exists exactly one automaton with the minimal number of states. A priori, theremight be two different automata with a minimal amount of states, but that is not thecase. Of course, being the same automaton means up-to isorphism. Isomorphic means“structually identical” basically it means, the “names” of the states don’t matter, butotherwise the automata are the same; there is a one-to-one correspondance between statesof both automata, that is honored by the transitions (and also the initial and final statesets).

We learn the algorithm that systematically calculates the minimal DFA from a given DFA.Previously, we downplayed the question, whether a DFA is complete or not, because ifnot complete, one can easily think of it as complete, assuming an extra error state. In theminimization here, it’s important to indeed have a complete deterministic automaton, allstates participate in the construction, including an extra error state that may need to beadded to complete the DFA.

The one presented here is only one of different ways to achieve the goal. It’s known asHopcrofts’s partitioning refinement algo.

2 Scanning2.8 Minimization 71

Minimization

• automatic construction of DFA (via e.g. Thompson): often many superfluous states• goal: “combine” states of a DFA without changing the accepted language

Properties of the minimization algo

Canonicity: all DFA for the same language are transformed to the same DFA

Minimality: resulting DFA has minimal number of states

Remarks

• “side effects”: answers two equivalence problems– given 2 DFA: do they accept the same language?– given 2 regular expressions, do they describe the same language?

• modern version: Hopcroft [7].

Hopcroft’s partition refinement algo for minimization

• starting point: complete DFA (i.e., error-state possibly needed)• first idea: equivalent states in the given DFA may be identified• equivalent: when used as starting point, accepting the same language• partition refinement:

– works “the other way around”– instead of collapsing equivalent states:

∗ start by “collapsing as much as possible” and then,∗ iteratively, detect non-equivalent states, and then split a “collapsed” state∗ stop when no violations of “equivalence” are detected

• partitioning of a set (of states):• worklist: data structure of to keep non-treated classes, termination if worklist is

empty

The algorithm will be explained to some extent in the following. The slides stated thatit works “the other way around”. I would expect, when tasked with the problem ofminimizing a given DFA, most would try to approach the problem the following way.One would look at an the automaton and would find situations when there save somestate. That’s a natural way of thinking, also when one try concrete examples on pen andpaper. For instance, one could look at the DFA for identifiers from Figure 2.2. It hasthree accepting states, but it does not take long to realize that only one is enough. Onecan “merge” the two states because the “do the same”. By “doing the same” it’s meantthat both accept the same language, when one starts in them for accepting words. Thatlanguage can be described by the regular expression (letter | digit)∗. Collapsing the two(or three) states into one makes the automaton smaller without changing the acceptedlanguage. That could be a core step for an algorithm: hunt for a pair of equivalent states,collapse them, then then hunt for another pair and continue until only non-equivalentstates remain, and then stop.


That’s a valid idea. One would check some aspects before being sure it works. Terminationis obvious. Another issue would be: is it important in which order to do the collapsing.It is a priori clear whether the algo would be independent from the strategy which pairsto collapse first. It could be that by choosing “wrong”, one gets a smaller automaton, butsomehow get stuck before reaching the really minimal one. That would be an unpleasantproperty of the approach (and would lead to backtracking). One also would have to solvethe problem of checking when are two states equivalent (and that might be computationallycomplex).

But, as said, it’s a valid idea (and I am rather sure, the approach would indepedent fromthe order of collapsing). The algorithm we describe below works the other way aroundin that it’s not based on merging equivalent states, but by starting out with a “collapsedautomaton”, where all states are collapsed, and the splits them repeatedly (based on acriterion described later). The algo is not very obvious. In the merge-based, naive one, onestarts with a DFA and in each steps, it gets smaller, but the algo maintains as invariantthat all the intermediately constructed DFAs accept the same language as the original.Thereby it’s clear that the result is likewise equivalent. And since we stop, when there areno more non-equivalent states, it’s also plausible, that the result is minimal.

How is the situation in Hopcroft’s algo, which works the other way around? Well, westart by some collapsed automaton. Being basically fully collapsed (with 2 states only),it will be generally not equivalent to the DFA, it accepts a differen (larger) language. Itwill also be smaller than the minimal one. The algorithm proceeds by splitting collapsedstates as long as the splitting criterion is fulfilled. Note that all the intermediate DFA arenon-equivalent to the targeted DFA and all of them are smaller than the minimal one.Once the splitting-criterion is no longer satisifed, one stops, and one has reach the firstautomaton in this process which, surprise, surprise, is equivalent to the targeted one, andat the same time is the minimal one.

That is the high-level idea of the algorithm. We have not intiutively explained the splitting-criterion. We will do that after we had a look at a “pseudo-code” description of thepartitioning-refinement algo.

Partition refinement: a bit more concrete

• Initial partitioning: 2 partitions: set containing all accepting states F , set containingall non-accepting states Q\F

• Loop do the following: pick a current equivalence class Qi and a symbol a– if for all q ∈ Qi, δ(q, a) is member of the same class Qj ⇒ consider Qi as done

(for now)– else:

∗ split Qi into Q1i , . . . Q

ki s.t. the above situation is repaired for each Qli (but

don’t split more than necessary).∗ be aware: a split may have a “cascading effect”: other classes being finebefore the split of Qi need to be reconsidered ⇒ worklist algo

• stop if the situation stabilizes, i.e., no more split happens (= worklist empty, atlatest if back to the original DFA)


The initialization, as mentioned before, starts with an (almost completely) collapsed au-tomaton. It’s not totally collapsed to a one-state representation, but consists of 2 states,no matter how big the original automaton is.

The algo speaks about partitions and operates by refining them. A partition is a technicalterm about sets, is splitting up a set into different (non-empty) subsets in such a way, thateach element of the original set is in exactly one of the subsets, and the the union of allsubsets is the original set. Alternatively (and equivalently), a partition on a set can beseens as equivalence relation on the set (an equivalence relation being a binary relationwhich is reflexive, transitive, and symmetric). We won’t dig into mathematical depth here,so let’s just illustrate it in a very examples. Assume a five-element set

A = {1, 2, 3, 4, 5} .

We can partition it into two subsets

{{1, 2, 3}, {4, 5}}

Let’s call the two subsets A1 and A2.

Equivalently, one can see that partition as considering 1, 2, and 3 and “equivalent” andlikewise 4 and 5. In other words, the partition corresponds to an equivalence relation. Ifone likes to spell the equivalence relation ∼ ⊂ A×A in full detail as set of pairs, it wouldbe

∼= {(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (3, 1), (2, 3), (3, 2), (3, 3), (4, 4), (4, 5), (5, 4), (5, 5)}

which corresponds to A21+A2

2. Both views are interchangable. Seen as equivalence relation,one can also view the algorithm as refining a equivalence relation instead of a partitioning.Remember when discussing the naive “merging approach”, we merged “non-equivalent”states. So, also there, we were working with an equivalence relation, what was meantthere was semantical language equivalence. To states are equivalent, if they accept thesame language, when starting acceptance runs from there.

Of course here, during the run of the automaton, the equivalence relation that correspondsto the current partition is not yet semantic language equivalence, it’s a more coarse-grainedequivalence relation, considering states currently as equivalent (grouped together in thesame subset of the partition), when in fact, semantically, they are not equivalent. Whenthe algo stops, though, the equivalence relation coincides with the intended langugueequivalence.

Here, we are working with partitons of the set of states of the given DFA, and we startwith a partition, consisting of to subsets: the set of states is split into two parts: theaccepting states and the non-accepting states. The algorithm works in one direction only:namely by taking a subset, i.e., one element in the current partition of Q, and splits that,if needed. The partition gets more finegrained with each iteration, until no more splittingcan be done.


If one looks at some partition during the run of the algo, one can conceptually interpretthe partition as an automaton: Each subset of the partition forms some “meta-state”consisting of sets of states, and there are transitions between those meta-states in theobvious. In this way, the algo not just steps through a sequence of partitions it refines,but at least conceptually, to a sequence of automata. This is a way of “thinking” aboutthe run of the algo, the algo itself does not explicitly construct sequences of automata, itworks on a sequence of partitions that gets more and more finegrained during the run.

However, thinking in terms of intermediate automata helps to interpret the splitting-condition: when (and how) should the algo split a meta-state, and when can it stop. Asmentioned earlier, starting from the initial 2-state automaton, the intermediate automataare generally smaller than the minimal one, and they are accept a language different fromthe one of the target automaton (a larger language, actually). There is a third aspect,not mentioned so far: at and intermediate stage, the automaton with the meta-states isgenerally non-deterministic. It’s clear that if one takes a DFA and collapses some statesinto one meta-state, the result will no longer be deterministic. That is also the splitting-condition. The algo looks at meta-states (i.e., a subset in the current partition) and ifthat meta-state violates the requirement that it should be determistic, then it splits it.Actually, the algo checks whether or not a meta-state is determistic per symbol, i.e., thealgo checks where some meta-state Q and a symbol a behave deterministically or not.

If the meta-state behaves non-deterministically, we have to repair that, and that’s bysplitting the that meta-state, so the the resulting split behaves deterministically (withrespect to that symbol a). Howerver, we split only as much as we need to repair thenon-determistic violation, but not more. For instance, one does not simply “atomize” themeta-states into its individual original states. Those would surely behave deterministic asthe starting point had been DFA, but this way, we won’t get the minimal automaton ingeneral, as we would do more splits than actually necessary.

So far so good. Of course, one need to treat more than a, it may be necessary, aftersplitting a meta-state wrt. a, that one need to split the result further wrt. b. That’sclear, and let’s not talk about that, let’s focus on one symbol. More interesting is thequestion: after having split one meta-state in the way described, making the fragmentsdeterministic, am I done with the fragments, or will I have to split them further? Theanswer is: doing it one time may not be enough. The reason is as follows. Splittinga meta-state in the way described may have a rippling effect on other meta-states. Forinstance, if one has a stituation like

Qa−→ Q′

and the meta-state Q′ happens to be split in, say, two refined meta-states Q′1 and Q′2,then the predecessor state Q suddenly has 2 outgoing a-transitions even if we assume thatsometimes earlier, Q was the result of some splitting step, making it determistic at thatearlier point. That mean, splitting a state may affect that other states have to be splitagain, that is the mentioned rippling effect.

A good way to organize the splitting task is to put all the current meta-states that havenot been checked wether they need a split or not into a work list. It may not technicallybe a list, but could be a queue or stack, or in general a collection data type, but the


algo would still be called worklist algorithm. Anyway, with this data structure, one canremove one piece of work, a current meta state out from the work-list, splits it, if necessary,removes the piece of work, but (re-)adds predecessor states, as they need to be recheckedand re-treated.

Partition refinement vs. merging equivalent states We started earlier by claiming thata naive approach would probably try to merge equivalent states starting from the givenDFA (with would be a “partition coarsening”), as that seems more obvious. Now, whyis the partition refinement algo intuitvely a better idea (without going into algorithmiccomplexity considerations)?

In a way, the two approaches (refinement vs. coarseing) look pretty similar. One mergesstates resp. split states, until no more merging resp. splitting is neccessary, and thenstops. It’s also not easy to say, which is the shorter route, i.e., which approach needs onaverage the least amount of iterations (perhaps in the special case where the automatoncomes via Thompson’s construction and determinzation).

There is a significant difference, though, that that’s the condition to decide when to stop(resp. if still merging resp. splitting is necessary). In Hopcrof’s refinement approach, thecheck is local. The condition concerns the next single edges originating in a (meta)-state.If they violate the determinism-requirement: then split, otherwise not.

The condition on the merging approach is not-local. They require to check wether to statesaccept the same language. That cannot be checked by the looking one step ahead, checkingthe outgoing edges. That involves checking all reachable states, and is a much morecomplicated condition. Perhaps some memoization (remembering and caching (partial)earlier checks) can help a bit, but Hopcroft’s partitioning refinment seems not only moreclever, it looks also superior.

Split in partition refinement: basic step

q1

q2

q3

q4

q5

q6

ab

cd

e

a

a

a

aa

a

• before the split {q1, q2, . . . , q6}


• after the split on a: {q1, q2}, {q3, q4, q5}, {q6}

Note The pic shows only one letter a, in general one has to do the same constructionfor all letters of the alphabet.

Examples

The following examples are shown in overlays in the slides. They unfolging of the overlaysis not done for the script version here.

Completed automaton

{1} {2, 3, 4, 5, 7, 10}

{4, 5, 6, 7, 9, 10}

{4, 5, 7, 8, 9, 10}error

letter

letter

digit

digitletter

letter

digit

digit

Minimized automaton (error state omitted)

start in_idletter

letter

digit


Another example: partition refinement & error state

(a | ε)b∗ (2.19)

1 2

3

a

b

b

b

Partition refinement

error state added initial partitioning split after a

1 2

3 error

a

b

b

b

a

a

End result (error state omitted again)

{1} {2, 3}

a

b

b

78 2 Scanning2.9 Scanner implementations and scanner generation tools

2.9 Scanner implementations and scanner generation tools

This last section contains only rather superficial remarks concerning how to implementas scanner or lexer. A few more details can be found in [6, Section 2.5]. The oblig willinclude the implementation of a lexer/scanner.

Tools for generating scanners

• scanners: simple and well-understood part of compiler• hand-coding possible• mostly better off with: generated scanner• standard tools lex / flex (also in combination with parser generators, like yacc /

bison• variants exist for many implementing languages• based on the results of this section

Main idea of (f)lex and similar

• output of lexer/scanner = input for parser• programmer specifies regular expressions for each token-class and corresponding ac-

tions (and whitespace, comments etc.)• the spec. language offers some conveniences (extended regexpr with priorities, asso-

ciativities etc) to ease the task• automatically translated to NFA (e.g. Thompson)• then made into a deterministic DFA (“subset construction”)• minimized (with a little care to keep the token classes separate)• implement the DFA (usually with the help of a table representation)

Tokens and actions of a parser will be covered later. For example, identifiers and digits asdescribed by the regular expressions, would end up in two different token classes, where theactual string of characters (also known as lexeme) being the value of the token attribute.

Sample flex file (excerpt)

12 DIGIT [0−9]3 ID [ a−z ] [ a−z0−9]∗45 %%67 {DIGIT}+ {8 p r i n t f ( "An integer : %s (%d)\n " , yytext ,9 a t o i ( yytext ) ) ;

10 }1112 {DIGIT}+"."{DIGIT}∗ {13 p r i n t f ( "A f l o a t : %s (%g )\n " , yytext ,14 a to f ( yytext ) ) ;

2 Scanning2.9 Scanner implementations and scanner generation tools 79

15 }1617 i f | then | begin | end | procedure | function {18 p r i n t f ( "A keyword : %s \n " , yytext ) ;19 }

80 3 Grammars

GrammarsChapter

Whatis it

about? Learning Targets of this Chapter1. (context-free) grammars + BNF2. ambiguity and other properties3. terminology: tokens, lexemes4. different trees connected to

grammars/parsing5. derivations, sentential forms

The chapter corresponds to [6,Section 3.1–3.2] (or [9, Chapter 3]).

Contents

3.1 Introduction . . . . . . . . . . 803.2 Context-free grammars and

BNF notation . . . . . . . . . 843.3 Ambiguity . . . . . . . . . . . 943.4 Syntax of a “Tiny” language . 1053.5 Chomsky hierarchy . . . . . . 107

3.1 Introduction

The compiler phase after the lexer is the parser. In the lecture, treating that phase is donein 2 chapters. One that covers the underlying concepts, namely context-free grammars,and one that deals with the parsing process. Context-free grammars resp. notationsfor context-free grammars play the same role for parsing that regular expressions playedfor lexing. There are grammars other than context-free grammars, later we will at leastmention the so-called Chomsky hierarchy, the most well-known classification of languagedescription formalisms. Context-free languages correspond to one level there, and actuallyregular language to another one, actually the simplest level; regular language can be seenas a restricted form of context-free languages.

Context-free grammars are probably the most-well known example of grammars, so whenspeaking simply about “a grammar”, one often just means context-free grammar, thoughthere are other types as well, as said.

Context-free grammars specify the syntax of a language, as opposed to regular expressions,which specify the lexical aspect of the language. That’s basically by convention: the syntaxof the language refers to those aspects that can be captured by a context-free grammar.

When it comes to parsing, one typcially don’t make use of the full power of context-freegrammars, one restricts oneself to special, limited forms of them, for practical reasons.We come to that in the parsing chapter. One restriction one wants to impose in parsingwill already be discussed in this chapter. That is that one does not want the grammar tobe ambigious. Ambiguous grammars are not useful in parsing, as we will discuss.

3 Grammars3.1 Introduction 81

Bird’s eye view of a parser

sequenceof to-kens

Parsertreerepresen-tation

• check that the token sequence correspond to a syntactically correct program– if yes: yield tree as intermediate representation for subsequent phases– if not: give understandable error message(s)

• we will encounter various kinds of trees– derivation trees (derivation in a (context-free) grammar)– parse tree, concrete syntax tree– abstract syntax trees

• mentioned tree forms hang together, dividing line a bit fuzzy• result of a parser: typically AST

(Context-free) grammars

• specifies the syntactic structure of a language• here: grammar means CFG• G derives word w

Parsing

Given a stream of “symbols” w and a grammar G, find a derivation from G that producesw.

Parsing is concerned with context-free grammars. Often, one will not try to use the full-power of context-free grammars, but make some restructions. To the very least, one insistson the grammar to be non-ambiguous. We come to the important notion of ambiguity ofcontext-free grammars (and of context-free languages) later. When seen generally, thereare different classes of grammars, some more restritive than context-free grammars, somemore expressive. Actually, regular languages correspond to a restricted form of context-free languages. They are too restricted, thought, to be used for parsing (but good enoughfor lexing).

The slide talks about deriving “words”. In general, words are finite sequences of symbolsfrom a given alphabet (as was the case for regular languages). In the concrete pictureof a parser, the words are sequences of tokens, which are the elements that come out ofthe scanner. A successful derivation leads to tree-like representations. There a variousslightly different forms of trees connected with grammars and parsing, which we will latersee in more detail; for a start now, we will just illustrate such tree-like structures, withoutdistinguishing between (abstract) syntax trees and parse trees.

82 3 Grammars3.1 Introduction

Sample syntax tree

program

stmts

stmt

assign-stmt

expr

+

var

y

var

x

var

x

decs

val=vardec

Syntax tree

The displayed syntax tree is meant “impressionistic” rather than formal. Neither is ita sample syntax tree of a real programming language, nor do we want to illustrate forinstance special features of an abstract syntax tree vs. a concrete syntax tree (or a parsetree). Those notions are closely related and corresponding trees might all look similarto the tree shown. There might, however, be subtle conceptual and representationaldifferences in the various classes of trees. Those are not relevant yet, at the beginning ofthe section.

Natural-language parse tree

S

NP

DT

The

N

dog

VP

V

bites

NP

DT

the

N

man

The concept of context-free grammars goes back to Chomsky (and Schützenberger). Theywere (also) used in describing natural languages, not computer languages (Chomsky is,among other things, a linguist). So the tree represents the syntactic structure of a (simple)English sentence. What it exactly supposed to mean is not too important (VP and NPstand for verb-phrase and noun-phrase etc.). Linguists

3 Grammars3.1 Introduction 83

“Interface” between scanner and parser

• remember: task of scanner = “chopping up” the input char stream (throw away whitespace, etc.) and classify the pieces (1 piece = lexeme)

• classified lexeme = token• sometimes we use 〈integer, ”42”〉

– integer: “class” or “type” of the token, also called token name– ”42” : value of the token attribute (or just value). Here: directly the lexeme (a

string or sequence of chars)• a note on (sloppyness/ease of) terminology: often: the token name is simply just

called the token• for (context-free) grammars: the token (symbol) corrresponds there to terminal

symbols (or terminals, for short)

Token names and terminals

Remark 10 (Token (names) and terminals). We said, that sometimes one uses the name“token” just to mean token symbol, ignoring its value (like “42” from above). Especially,in the conceptual discussion and treatment of context-free grammars, which form the coreof the specifications of a parser, the token value is basically irrelevant. Therefore, onesimply identifies “tokens = terminals of the grammar” and silently ignores the presenceof the values. In an implementation, and in lexer/parser generators, the value ”42” of aninteger-representing token must obviously not be forgotten, though . . .The grammar maybe the core of the specification of the syntactical analysis, but the result of the scanner,which resulted in the lexeme ”42” must nevertheless not be thrown away, it’s only not reallypart of the parser’s tasks.

Notations

Remark 11. Writing a compiler, especially a compiler front-end comprising a scannerand a parser, but to a lesser extent also for later phases, is about implementing repre-sentation of syntactic structures. The slides here don’t implement a lexer or a parser orsimilar, but describe in a hopefully unambiguous way the principles of how a compiler frontend works and is implemented. To describe that, one needs “language” as well, such asEnglish language (mostly for intuitions) but also “mathematical” notations such as regu-lar expressions, or in this section, context-free grammars. Those mathematical definitionshave themselves a particular syntax. One can see them as formal domain-specific lan-guages to describe (other) languages. One faces therefore the (unavoidable) fact that onedeals with two levels of languages: the language that is described (or at least whose syntaxis described) and the language used to descibe that language. The situation is, of course,when writing a book teaching a human language: there is a language being taught, anda language used for teaching (both may be different). More closely, it’s analogous whenimplementing a general purpose programming language: there is the language used to im-plement the compiler on the one hand, and the language for which the compiler is writtenfor. For instance, one may choose to implement a C++-compiler in C. It may increasethe confusion, if one chooses to write a C compiler in C . . . . Anyhow, the language for

84 3 Grammars3.2 Context-free grammars and BNF notation

describing (or implementing) the language of interest is called the meta-language, and theother one described therefore just “the language”.

When writing texts or slides about such syntactic issues, typically one wants to makeclear to the reader what is meant. One standard way are typographic conventions, i.e.,using specific typographic fonts. I am stressing “nowadays” because in classic texts incompiler construction, sometimes the typographic choices were limited (maybe written as“typoscript”, i.e., as “manuscript” on a type writer).

3.2 Context-free grammars and BNF notation

Grammars

• in this chapter(s): focus on context-free grammars• thus here: grammar = CFG• as in the context of regular expressions/languages: language = (typically infinite) set

of words• grammar = formalism to unambiguously specify a language• intended language: all syntactically correct programs of a given progamming lan-

guage

Slogan

A CFG describes the syntax of a programming language. 1

Note: a compiler might reject some syntactically correct programs, whose violations can-not be captured by CFGs. That is done by subsequent phases. For instance, the typechecker may reject syntactically correct programs that are ill-typed. The type checker isan important part from the semantic phase (or static analysis phase). A typing disciplineis not a syntactic property of a language (in that it cannot captured most commonly bya context-free grammar), it’s therefore a “semantics” property.

Remarks on grammars

Sometimes, the word “grammar” is synonymously for context-free grammars, as CFGsare so central. However, the concept of grammars is more general; there exists context-sensitive and Turing-expressive grammars, both more expressive than CFGs. Also a re-stricted class of CFG correspond to regular expressions/languages. Seen as a grammar,regular expressions correspond so-called left-linear grammars (or alternativelty, right-lineargrammars), which are a special form of context-free grammars.

1And some say, regular expressions describe its microsyntax.

3 Grammars3.2 Context-free grammars and BNF notation 85

Context-free grammar

Definition 3.2.1 (CFG). A context-free grammar G is a 4-tuple G = (ΣT ,ΣN , S, P ):

1. 2 disjoint finite alphabets of terminals ΣT and2. non-terminals ΣN

3. 1 start-symbol S ∈ ΣN (a non-terminal)4. productions P = finite subset of ΣN × (ΣN + ΣT )∗

• terminal symbols: corresponds to tokens in parser = basic building blocks of syntax• non-terminals: (e.g. “expression”, “while-loop”, “method-definition” . . . )• grammar: generating (via “derivations”) languages• parsing: the inverse problem⇒ CFG = specification

Further notions

• sentence and sentential form• productions (or rules)• derivation• language of a grammar L(G)• parse tree

Those notions will be explained with the help of examples.

BNF notation

• popular & common format to write CFGs, i.e., describe context-free languages• named after pioneering (seriously) work on Algol 60• notation to write productions/rules + some extra meta-symbols for convenience and

grouping

Slogan: Backus-Naur form

What regular expressions are for regular languages is BNF for context-free languages.

“Expressions” in BNF

exp → exp op exp | ( exp ) | numberop → + | − | ∗

(3.1)

• “→” indicating productions and “ | ” indicating alternatives• convention: terminals written boldface, non-terminals italic• also simple math symbols like “+” and “(′′ are meant above as terminals• start symbol here: exp• remember: terminals like number correspond to tokens, resp. token classes. The

attributes/token values are not relevant here.

https://en.wikipedia.org/wiki/ALGOL_60


The grammar on the slide consists of 6 productions/rules, 3 for expr and 3 for op, the |is just for convenience. Side remark: Often also ::= is used for →.

Terminals

Conventions are not always 100% followed, often bold fonts for symbols such as + or ( areunavailable or not easily visible. The alternative using, for instance, boldface “identifiers”like PLUS and LPAREN looks ugly. Some books would write ’+’ and ’(’.

In a concrete parser implementation, in an object-oriented setting, one might choose toimplement terminals as classes (resp. concrete terminals as instances of classes). In thatcase, a class name + is typically not available and the class might be named Plus. Laterwe will have a look at how to systematically implement terminals and non-terminals, andhaving a class Plus for a non-terminal ‘+’ etc. is a systematic way of doing it (maybenot the most efficient one available though).

Most texts don’t follow conventions so slavishly and hope for an intuitive understandingby the educated reader, that + is a terminal in a grammar, as it’s not a non-terminal,which are written here in italics.

Different notations

• BNF: notationally not 100% “standardized” across books/tools• “classic” way (Algol 60):

<exp> : := <exp> <op> <exp>| ( <exp> )| NUMBER

<op> : := + | − | ∗

• Extended BNF (EBNF) and yet another style

exp → exp ( ” + ” | ”− ” | ” ∗ ” ) exp| ”(” exp ”)” | ”number”

(3.2)

• note: parentheses as terminals vs. as metasymbols

“Standard” BNF

Specific and unambiguous notation is important, in particular if you implement a concretelanguage on a computer. On the other hand: understanding the underlying concepts byhumans is equally important. In that way, bureaucratically fixed notations may distractfrom the core, which is understanding the principles. XML, anyone? Most textbooks (andwe) rely on simple typographic conventions (boldfaces, italics). For “implementations” ofBNF specification (as in tools like yacc), the notations, based mostly on ASCII, cannotrely on such typographic conventions.


Syntax of BNF

BNF and its variations is a notation to describe “languages”, more precisely the “syntax”of context-free languages. Of course, BNF notation, when exactly defined, is a languagein itself, namely a domain-specific language to describe context-free languages. It maybe instructive to write a grammar for BNF in BNF, i.e., using BNF as meta-language todescribe BNF notation (or regular expressions). Is it possible to use regular expressionsas meta-language to describe regular expression?

Different ways of writing the same grammar

• directly written as 6 pairs (6 rules, 6 productions) from ΣN × (ΣN ∪ΣT )∗, with “→”as nice looking “separator”:

exp → exp op expexp → ( exp )exp → numberop → +op → −op → ∗

(3.3)

• choice of non-terminals: irrelevant (except for human readability):

E → E O E | ( E ) | numberO → + | − | ∗

(3.4)

• still: we count 6 productions

Grammars as language generators

Deriving a word:

Start from start symbol. Pick a “matching” rule to rewrite the current word to a new one;repeat until terminal symbols, only.

• non-deterministic process• rewrite relation for derivations:

– one step rewriting: w1 ⇒ w2– one step using rule n: w1 ⇒n w2– many steps: ⇒∗ , etc.

Non-determinism means, that the process of derivation allows choices to be made, whenapplying a production. One can distinguish 2 forms of non-determinism here: 1) a senten-tial form contains (most often) more than one non-terminal. In that situation, one has thechoice of expanding one non-terminal or the other. 2) Besides that, there may be morethan one production or rule for a given non-terminal. Again, one has a choice.


As far as 1) is concerned. whether one expands one symbol or the other leads to differentderivations, but won’t lead to different derivation trees or parse trees in the end. Below,we impose a fixed discipline on where to expand. That leads to left-most or right-mostderivations.

Language of grammar G

L(G) = {s | start ⇒∗ s and s ∈ Σ∗T }

Example derivation for (number−number)∗number

exp ⇒ exp op exp⇒ (exp) op exp⇒ (exp op exp) op exp⇒ (n op exp) op exp⇒ (n−exp) op exp⇒ (n−n)op exp⇒ (n−n)∗exp⇒ (n−n)∗n

• underline the “place” where a rule is used, i.e., an occurrence of the non-terminalsymbol is being rewritten/expanded

• here: leftmost derivation2

Rightmost derivation

exp ⇒ exp op exp⇒ exp op n⇒ exp∗n⇒ (exp op exp)∗n⇒ (exp op n)∗n⇒ (exp−n)∗n⇒ (n−n)∗n

• other (“mixed”) derivations for the same word possible

2We’ll come back to that later, it will be important.


Some easy requirements for reasonable grammars

• all symbols (terminals and non-terminals): should occur in a some word derivablefrom the start symbol

• words containing only non-terminals should be derivable• an example of a silly grammar G (start-symbol A)

A → BxB → AyC → z

• L(G) = ∅• those “sanitary conditions”: minimal “common sense” requirements

There can be further conditions one would like to impose on grammars besides the onesketched. A CFG that derives ultimately only 1 word of terminals (or a finite set of those)does not make much sense either. There are further conditions on grammar characterizingtheir usefulness for parsing. So far, we mentioned just some obvious conditions of “useless”grammars or “defects” in a grammer (like superfluous symbols). “Usefulness conditions”may refer to the use of ε-productions and other situations. Those conditions will bediscussed when the lecture covers parsing (not just grammars).

Remark 12 (“Easy” sanitary conditions for CFGs). We stated a few conditions to avoidgrammars which technically qualify as CFGs but don’t make much sense, for instance toavoid that the grammar is obviously empty; there are easier ways to describe an empty set. . .

There’s a catch, though: it might not immediately be obvious that, for a given G, thequestion L(G) =? ∅ is decidable!

Whether a regular expression describes the empty language is trivially decidable. Whetheror not a finite state automaton descibes the empty language or not is, if not trivial, thenat least a very easily decidable question. For context-sensitive grammars (which are moreexpressive than CFG but not yet Turing complete), the emptyness question turns out to beundecidable. Also, other interesting questions concerning CFGs are, in fact, undecidable,like: given two CFGs, do they describe the same language? Or: given a CFG, doesit actually describe a regular language? Most disturbingly perhaps: given a grammar,it’s undecidable whether the grammar is ambiguous or not. So there are interesting andrelevant properties concerning CFGs which are undecidable. Why that is, is not part ofthe pensum of this lecture (but we will at least have to deal with the important conceptof grammatical ambiguity later). Coming back for the initial question: fortunately, theemptyness problem for CFGs is decidable.

Questions concerning decidability may seem not too relevant at first sight. Even if somegrammars can be constructed to demonstrate difficult questions, for instance related todecidability or worst-case complexity, the designer of a language will not intentionally tryto achieve an obscure set of rules whose status is unclear, but hopefully strive to capture ina clear manner the syntactic principles of an equally hopefully clearly structured language.Nonetheless: grammars for real languages may become large and complex, and, even ifconceptually clear, may contain unexpected bugs which makes them behave unexpectedly(for instance caused by a simple typo in one of the many rules).


In general, the implementor of a parser will often rely on automatic tools (“parser gener-ators”) which take as an input a CFG and turns it in into an implementation of a recog-nizer, which does the syntactic analysis. Such tools obviously can reliably and accuratelyhelp the implementor of the parser automatically only for problems which are decidable.For undecidable problems, one could still achieve things automatically, provided one wouldcompromise by not insisting that the parser always terminates (but that’s generally is seenas unacceptable), or at the price of approximative answers. It should also be mentionedthat parser generators typcially won’t tackle CFGs in their full generality but are tailor-made for well-defined and well-understood subclasses thereof, where efficient recognizersare automaticlly generatable. In the part about parsing, we will cover some such classes.

Parse tree

• derivation: if viewed as sequence of steps ⇒ linear “structure”• order of individual steps: irrelevant• ⇒ order not needed for subsequent phases• parse tree: structure for the essence of derivation• also called concrete syntax tree.

1 exp

2 exp

n

3 op

+

4 exp

n

• numbers in the tree– not part of the parse tree, indicate order of derivation, only– here: leftmost derivation

There will be abstract syntax trees, as well, in contrast to concrete syntax trees or parsetrees covered here.

Another parse tree (numbers for rightmost derivation)

1 exp

4 exp

( 5 exp

8 exp

n

7 op

−

6 exp

n

)

3 op

∗

2 exp

n


Abstract syntax tree

• parse tree: contains still unnecessary details• specifically: parentheses or similar, used for grouping• tree-structure: can express the intended grouping already• remember: tokens contain also attribute values (e.g.: full token for token class n may

contain lexeme like ”42” . . . )

1 exp

2 exp

n

3 op

+

4 exp

n

+

3 4

AST vs. CST

• parse tree– important conceptual structure, to talk about grammars and derivations– most likely not explicitly implemented in a parser

• AST is a concrete data structure– important IR of the syntax (for the language being implemented)– written in the meta-language– therefore: nodes like + and 3 are no longer tokens or lexemes– concrete data stuctures in the meta-language (C-structs, instances of Java classes,

or what suits best)– the figure is meant schematic, only– produced by the parser, used by later phases– note also: we use 3 in the AST, where lexeme was "3"⇒ at some point, the lexeme string (for numbers) is translated to a number in the

meta-language (typically already by the lexer)

Plausible schematic AST (for the other parse tree)

*

-

34 3

42

• this AST: rather “simplified” version of the CST• an AST closer to the CST (just dropping the parentheses): in principle nothing

“wrong” with it either


We should repeat: the shown ASTs are “schemantic” and for illustration. It’s best to keepin mind, that in a concrete compiler, AST is a data structure. A specific source file is thenrepresented as a specific tree, i.e., instance of the AST data structure.

Conditionals

Conditionals G1

stmt → if -stmt | otherif -stmt → if ( exp ) stmt

| if ( exp ) stmt else stmtexp → 0 | 1

(3.5)

Conditionals in one syntactic form or other occur in basically all programming languages.As of now, we use the conditionals for not much more than pointing out something thatshould be rather obvious: there is (always) more than one way to describe an intendedlanguage by a context-free grammar. The same was the case for regular expression as well(and generally for all notational systems): there is always more than one way to describethings. Also, we make use for ε in one of the formulation (the empty word).

Of course, with more than one formulation, some may “better” than others. That mayrefer to “clarity” or readability for humans. But there are also aspects concerning toparsing. One formulation of a grammar may be in a form that is unhelpful for parsers.It may also depend of the chosen style of parsers: some formulations poses problems fortop-down parsers resp. for bottom-up parsers. Issues like that will be discussed in thechapter of parsing, here we are still covering grammars. In particular on connection withconditionals (which is a classic example): the chosen syntax here will lead to ambiguity,which we will discuss later. In this particula examples, both formulations of the gram-mar are ambiguous (it’s will be a classical example of ambiguitity). Actually, it’s quitestraightforward convince oneself, that one cannot reformulate the grammar even further,to get an equivalent but unambigous grammar. The ambiguity goes deeper (in this case):the language itself is ambiguous. We pick up on those issues later.

Parse tree

if ( 0 ) other else other


stmt

if -stmt

if ( exp

0

) stmt

other

else stmt

other

Another grammar for conditionals

Conditionals G2

stmt → if -stmt | otherif -stmt → if ( exp ) stmt else−part

else−part → else stmt | εexp → 0 | 1

(3.6)

Abbreviation

ε = empty word

A further parse tree + an AST

stmt

if -stmt

if ( exp

0

) stmt

other

else−part

else stmt

other

COND

0 other other

94 3 Grammars3.3 Ambiguity

A potentially missing else part may be represented by null-“pointers” in languages likeJava.

In functional languages, one could use “option” types to represent in a safer way the factthat the else part is there or may be missing. With null-pointers, there is always thedanger that the programmer forgets that the value may not be there and then forgets tocheck that case properly, and cause some null pointer exception.

3.3 Ambiguity

Before we mentioned some “easy” conditions to avoid “silly” grammars, without goinginto detail. Ambiguity is more important and complex. Roughly speaking, a grammar isambiguous, if there exist sentences for which there are two different parse trees. That’sin general highly undesirable, as it means there are sentences with different syntactic in-terpretations (which therefore may ultimately interpreted differently). That is mostly ano-no, but even if one would accept such a language definition, parsing would be problem-atic, as it would involve backtracking trying out different possible interpretations duringparsing (which would also be a no-no for reasons of efficiency) In fact, later, when dealingwith actual concrete parsing procedures, they cover certain specific forms of CFG (withnames like LL(1), LR(1), etc.), which are in particular non-ambiguous. To say it differ-ently: the fact that a grammar is parseable by some, say, LL(1) top-down parser (whichdoes not do backtracking) implies directly that the grammar is unambiguous. Similar forthe other classes we’ll cover.

Note also: given an ambiguous grammar, it is often possible to find a different “equivalent”grammar that is unambiguous. Even if such reformulations are often possible, it’s notguaranteed: there are context-free languages which do have an ambiguous grammar, butno unambigous one. In that case, one speaks of an ambiguous context-free language. Weconcentrate on ambiguity of grammars.

Now that we have said that ambiguity in grammars must be avoided, we should howeveralso say, that, in certain situations, one can in some way live with it. One way of livingwith it is: imposing extra conditions on the way the grammar is used, that removesit (in a way, priorizing some rules over others). In practice, that often takes the formof specifying assiciativity and binding powers of operators, like making clear that 1 +2 + 3 is “supposed” to be interpreted as (1 + 2) + 3 (addition is left-associative) and1 + 2 × 3 is the same as 1 + (2 × 3) (multiplication binds stronger than addition). Thegrammar as such is ambigiguous, but that’s fine, since one can make it non-ambiguous byimposing such additional constraints. And not only can one do that technically, that formof disambiguation is also transparent for the user.

3 Grammars3.3 Ambiguity 95

Tempus fugit . . .

picture source: wikipedia

One famous sentence often used to illustrate ambiguity in natural languages is “Timeflies like a banana”. That sentence is often attributed to Groucho Marx, but it’s a bitaprocryphal.

Ambiguous grammar

Definition 3.3.1 (Ambiguous grammar). A grammar is ambiguous if there exists a wordwith two different parse trees.

Remember grammar from equation (3.1):


Consider:

n−n∗n

2 CTS’s

exp

exp

exp

n

op

−

exp

n

op

∗

exp

n


exp

exp

n

op

−

exp

exp

n

op

∗

exp

n

2 resulting ASTs

∗

−

34 3

42

−

34 ∗

3 42

different parse trees ⇒ different ASTs ⇒ different meaning

Side remark: different meaning

The issue of “different meaning” may in practice be subtle: is (x + y) − z the same asx+ (y − z)? In principle yes, but what about MAXINT ?

The slides stipulates that difffernet parse trees lead to different ASTs and this in turninto different meanings. That is principle correct, but there may be special circumstanceswhen that’s not the case. Different CSTs may actually result in the same AST. Or also:it may lead to different AST which turn out to have the same meaning. The slide gavean example of where it’s debatable whether two different ASTs have the same meaning ornot.

Precendence & associativity

• one way to make a grammar unambiguous (or less ambiguous)• for instance:

binary op’s precedence associativity+, − low left×, / higher left↑ highest right

• a ↑ b written in standard math as ab:

5 + 3/5× 2 + 4 ↑ 2 ↑ 3 =5 + 3/5× 2 + 423 =(5 + ((3/5× 2)) + (4(23))) .

• mostly fine for binary ops, but usually also for unary ones (postfix or prefix)


Unambiguity without imposing explicit associativity and precedence

• removing ambiguity by reformulating the grammar• precedence for op’s: precedence cascade

– some bind stronger than others (∗ more than +)– introduce separate non-terminal for each precedence level (here: terms and fac-

tors)

The method sketched here (“precedence cascade”) is a receipe to massage a grammar insuch a way that the result captures intended precedences (and at the same time theirassociativities). It works in that way for syntax using binary operators. That receipe iscommonly illustrated using numerical expressions. We will encounter analogous tasks alsoin the exercises.

Expressions, revisited

• associativity– left-assoc: write the corresponding rules in left-recursive manner, e.g.:

exp → exp addop term | term

– right-assoc: analogous, but right-recursive– non-assoc:

exp → term addop term | term

factors and terms

exp → exp addop term | termaddop → + | −term → term mulop factor | factor

mulop → ∗factor → ( exp ) | number

(3.7)

34− 3 ∗ 42

exp

exp

term

factor

n

addop

−

term

term

factor

n

mulop

∗

factor

n


34− 3− 42

exp

exp

exp

term

factor

n

addop

−

term

factor

n

addop

−

term

factor

n

Ambiguity

As mentioned, the question whether a given CFG is ambiguous or not is undecidable.Note also: if one uses a parser generator, such as yacc or bison (which cover a practicallyusefull subset of CFGs), the resulting recognizer is always deterministic. In case theconstruction encounters ambiguous situations, they are “resolved” by making a specificchoice. Nonetheless, such ambiguities indicate often that the formulation of the grammar(or even the language it defines) has problematic aspects. Most programmers as “users” ofa programming language may not read the full BNF definition, most will try to grasp thelanguage looking at sample code pieces mentioned in the manual, etc. And even if theybother studying the exact specification of the system, i.e., the full grammar, ambiguitiesare not obvious (after all, it’s undecidable, at least the problem in general). Hiddenambiguities, “resolved” by the generated parser, may lead to misconceptions as to whata program actually means. It’s similar to the situation, when one tries to study a bookwith arithmetic being unaware that multiplication binds stronger than addition. Withoutbeing aware of that, some sections won’t make much sense. A parser implementing suchgrammars may make consistent choices, but the programmer using the compiler may notbe aware of them. At least the compiler writer, responsible for designing the language,will be informed about “conflicts” in the grammar and a careful designer will try toget rid of them. This may be done by adding associativities and precedences (whenappropriate) or reformulating the grammar, or even reconsider the syntax of the language.While ambiguities and conflicts are generally a bad sign, arbitrarily adding a complicated“precedence order” and “associativities” on all kinds of symbols or complicate the grammaradding ever more separate classes of nonterminals just to make the conflicts go away isnot a real solution either. Chances are, that those parser-internal “tricks” will be loston the programmer as user of the language, as well. Sometimes, making the languagesimpler (as opposed to complicate the grammar for the same language) might be thebetter choice. That can typically be done by making the language more verbose andreducing “overloading” of syntax. Of course, going overboard by making groupings etc.\of all constructs crystal clear to the parser, may also lead to non-elegant designs. Lisp isa standard example, notoriously known for its extensive use of parentheses. Basically, theprogrammer directly writes down syntax trees, which certainly removes ambiguities, but


still, mountains of parentheses are also not the easiest syntax for human consumption (formost humans, at least). So it’s a balance (and at least partly a matter of taste, as formost design choices and questions of language pragmatics).

But in general: if it’s enormously complex to come up with a reasonably unambigousgrammar for an intended language, chances are, that reading programs in that languageand intutively grasping what is intended may be hard for humans, too.

Note also: since already the question, whether a given CFG is ambiguous or not is un-decidable, it should be clear, that the following question is undecidable, as well: given agrammar, can I reformulate it, still accepting the same language, that it becomes unam-biguous?

Real life example

The scan is taken from an edition of the book “Java in a nutshell”. The next examplecovering C++ is clipped from the net


Another example

Non-essential ambiguity

left-assoc

stmt-seq → stmt-seq ; stmt | stmtstmt → S

stmt-seq

stmt-seq

stmt-seq

stmt

S

; stmt

S

; stmt

S


Non-essential ambiguity (2)

right-assoc representation instead

stmt-seq → stmt ; stmt-seq | stmtstmt → S

stmt-seq

stmt

S

; stmt-seq

stmt

S

; stmt-seq

stmt

S

Possible AST representations

Seq

S S S

Seq

S S S

Dangling else

Nested if’s

if ( 0 ) if ( 1 ) other else other

Remember grammar from equation (3.5):

stmt → if -stmt | otherif -stmt → if ( exp ) stmt

| if ( exp ) stmt else stmtexp → 0 | 1


Should it be like this . . .

stmt

if -stmt

if ( exp

0

) stmt

if -stmt

if ( exp

1

) stmt

other

else stmt

other

. . . or like this

stmt

if -stmt

if ( exp

0

) stmt

if -stmt

if ( exp

1

) stmt

other

else stmt

other

• common convention: connect else to closest “free” (= dangling) occurrence

Unambiguous grammar

Grammar

stmt → matched_stmt | unmatch_stmtmatched_stmt → if ( exp ) matched_stmt else matched_stmt

| otherunmatch_stmt → if ( exp ) stmt

| if ( exp ) matched_stmt else unmatch_stmtexp → 0 | 1

• never have an unmatched statement inside a matched one• complex grammar, seldomly used• instead: ambiguous one, with extra “rule”: connect each else to closest free if• alternative: different syntax, e.g.,


– mandatory else,– or require endif

CST

stmt

unmatch_stmt

if ( exp

0

) stmt

matched_stmt

if ( exp

1

) elsematched_stmt

other

Adding sugar: extended BNF

• make CFG-notation more “convenient” (but without more theoretical expressiveness)• syntactic sugar

EBNF

Main additional notational freedom: use regular expressions on the rhs of productions.They can contain terminals and non-terminals.

• EBNF: officially standardized, but often: all “sugared” BNFs are called EBNF• in the standard:

– α∗ written as {α}– α? written as [α]

• supported (in the standardized form or other) by some parser tools, but not in all• remember equation (3.2)


EBNF examples

A → β{α} for A→ Aα | β

A → {α}β for A→ αA | β

stmt-seq → stmt {; stmt}stmt-seq → {stmt ;} stmt

if -stmt → if ( exp ) stmt[else stmt]

greek letters: for non-terminals or terminals.

Some yacc style grammar

/∗ I n f i x n o t a t i o n c a l c u l a t o r−−c a l c ∗/%{#define YYSTYPE double#include <math . h>%}

/∗ BISON D e c l a r a t i o n s ∗/%token NUM%l e f t '− ' '+ '%l e f t ' ∗ ' ' / '%l e f t NEG /∗ n e g a t i o n−−unary minus ∗/%r i g h t ' ^ ' /∗ e x p o n e n t i a t i o n ∗/

/∗ Grammar f o l l o w s ∗/%%input : /∗ empty s t r i n g ∗/

| input l i n e;

l i n e : ' \n '| exp ' \n ' { p r i n t f ( " \ t %.10g\n " , $1 ) ; }

;

exp : NUM { $$ = $1 ; }| exp '+ ' exp { $$ = $1 + $3 ; }| exp '− ' exp { $$ = $1 − $3 ; }| exp ' ∗ ' exp { $$ = $1 ∗ $3 ; }| exp ' / ' exp { $$ = $1 / $3 ; }| '− ' exp %prec NEG { $$ = −$2 ; }| exp ' ^ ' exp { $$ = pow ( $1 , $3 ) ; }| ' ( ' exp ' ) ' { $$ = $2 ; }

;%%

3 Grammars3.4 Syntax of a “Tiny” language 105

3.4 Syntax of a “Tiny” language

BNF-grammar for TINY

program → stmt-seqstmt-seq → stmt-seq ; stmt | stmt

stmt → if -stmt | repeat-stmt | assign-stmt| read-stmt | write-stmt

if -stmt → if expr then stmt end| if expr then stmt else stmt end

repeat-stmt → repeat stmt-seq until exprassign-stmt → identifier := expr

read-stmt → read identifierwrite-stmt → write expr

expr → simple-expr comparison-op simple-expr | simple-exprcomparison-op → < | =

simple-expr → simple-expr addop term | termaddop → + | −term → term mulop factor | factor

mulop → ∗ | /factor → ( expr ) | number | identifier

Syntax tree nodes

typedef enum {StmtK,ExpK} NodeKind;typedef enum {IfK,RepeatK,AssignK,ReadK,WriteK} StmtKind;typedef enum {OpK,ConstK,IdK} ExpKind;

/* ExpType is used for type checking */typedef enum {Void,Integer,Boolean} ExpType;

#define MAXCHILDREN 3

typedef struct treeNode{ struct treeNode * child[MAXCHILDREN];

struct treeNode * sibling;int lineno;NodeKind nodekind;union { StmtKind stmt; ExpKind exp;} kind;union { TokenType op;

int val;char * name; } attr;

ExpType type; /* for type checking of exps */

Comments on C-representation

• typical use of enum type for that (in C)• enum’s in C can be very efficient• treeNode struct (records) is a bit “unstructured”

106 3 Grammars3.4 Syntax of a “Tiny” language

• newer languages/higher-level than C: better structuring advisable, especially for languageslarger than Tiny.

• in Java-kind of languages: inheritance/subtyping and abstract classes/interfaces often usedfor better structuring

Sample Tiny program

read x; { input as integer }if 0 < x then { don't compute if x <= 0 }fact := 1;repeat

fact := fact * x;x := x -1

until x = 0;write fact { output factorial of x }

end

Same Tiny program again

read x ; { input as i n t e g e r }i f 0 < x then { don ' t compute i f x <= 0 }

f a c t := 1 ;repeat

f a c t := f a c t ∗ x ;x := x −1

until x = 0 ;write f a c t { output f a c t o r i a l o f x }

end

• keywords / reserved words highlighted by bold-face type setting• reserved syntax like 0, :=, . . . is not bold-faced• comments are italicized

3 Grammars3.5 Chomsky hierarchy 107

Abstract syntax tree for a tiny program

Some questions about the Tiny grammy

• is the grammar unambiguous?• How can we change it so that the Tiny allows empty statements?• What if we want semicolons in between statements and not after?• What is the precedence and associativity of the different operators?

3.5 Chomsky hierarchy

The Chomsky hierarchy

• linguist Noam Chomsky [5]• important classification of (formal) languages (sometimes Chomsky-Schützenberger)• 4 levels: type 0 languages – type 3 languages• levels related to machine models that generate/recognize them• so far: regular languages and CF languages

Overview

rule format languages machines closed3 A→ aB , A→ a regular NFA, DFA all2 A→ α1βα2 CF pushdown

automata∪, ∗, ◦

1 α1Aα2 → α1βα2 context-sensitive

(linearly re-stricted au-tomata)

all

0 α→ β, α 6= ε recursivelyenumerable

Turing ma-chines

all, exceptcomplement

108 3 Grammars3.5 Chomsky hierarchy

Conventions

• terminals a, b, . . . ∈ ΣT ,• non-terminals A,B, . . . ∈ ΣN

• general words α, β . . . ∈ (ΣT ∪ ΣN )∗

Remark: Chomsky hierarchy

The rule format for type 3 languages (= regular languages) is also called right-linear. Alternatively,one can use left-linear rules. If one mixes right- and left-linear rules, one leaves the class of regularlanguages. The rule-format above allows only one terminal symbol. In principle, if one hadsequences of terminal symbols in a right-linear (or else left-linear) rule, that would be ok too.

Phases of a compiler & hierarchy

“Simplified” design?

1 big grammar for the whole compiler? Or at least a CSG for the front-end, or a CFG combiningparsing and scanning?

theoretically possible, but bad idea:

• efficiency• bad design• especially combining scanner + parser in one BNF:

– grammar would be needlessly large– separation of concerns: much clearer/ more efficient design

• for scanner/parsers: regular expressions + (E)BNF: simply the formalisms of choice!– front-end needs to do more than checking syntax, CFGs not expressive enough– for level-2 and higher: situation gets less clear-cut, plain CSG not too useful for compilers

4 Parsing 109

ParsingChapter

Whatis it

about?Learning Targets of this Chapter1. top-down and bottom-up parsing2. look-ahead3. first and follow-sets4. different classes of parsers (LL, LALR)

Contents

4.1 Introduction to parsing . . . 1094.2 Top-down parsing . . . . . . . 1124.3 First and follow sets . . . . . 1184.4 Massaging grammars . . . . . 1314.5 LL-parsing (mostly LL(1)) . . 1394.6 Error handling . . . . . . . . 1594.7 Bottom-up parsing . . . . . . 162

4.1 Introduction to parsing

What’s a parser generally doing

task of parser = syntax analysis

• input: stream of tokens from lexer• output:

– abstract syntax tree– or meaningful diagnosis of source of syntax error

• the full “power” (i.e., expressiveness) of CFGs not used• thus:

– consider restrictions of CFGs, i.e., a specific subclass, and/or– represented in specific ways (no left-recursion, left-factored . . . )

Syntax errors (and other errors)

Since almost by definition, the syntax of a language are those aspects covered by a context-freegrammar, a syntax error thereby is a violation of the grammar, something the parser has to detect.Given a CFG, typically given in BNF resp. implemented by a tool supporting a BNF variant, theparser (in combination with the lexer) must generate an AST exactly for those programs thatadhere to the grammar and must reject all others. One says, the parser recognizes the givengrammar. An important practical part when rejecting a program is to generate a meaningful errormessage, giving hints about potential locations of the error and potential reasons. In the most

110 4 Parsing4.1 Introduction to parsing

minimal way, the parser should inform the programmer where the parser tripped, i.e., telling howfar, from left to right, it was able to proceed and informing where it stumbled: “parser error inline xxx/at character position yyy”). One typically has higher expectations for a real parser thanjust the line number, but that’s the basics.

It may be noted that also the subsequent phase, the semantic analysis, which takes the abstractsyntax tree as input, may report errors, which are then no longer syntax errors but more complexkind of errors. One typical kind of error in the semantic phase is a type error. Also there, theminimal requirement is to indicate the probable location(s) where the error occurs. To do so, inbasically all compilers, the nodes in an abstract syntax tree will contain information concerning theposition in the original file the resp.\ node corresponds to (like line-numbers, character positions).If the parser would not add that information into the AST, the semantic analysis would have noway to relate potential errors it finds to the original, concrete code in the input. Remember: thecompiler goes in phases, and once the parsing phase is over, there’s no going back to scan the fileagain.

Lexer, parser, and the rest

lexer parser rest of thefront end

symbol table

sourceprogram

tokentoken

get next

token

AST interm.rep.

Top-down vs. bottom-up

• all parsers (together with lexers): left-to-right• remember: parsers operate with trees

– parse tree (concrete syntax tree): representing grammatical derivation– abstract syntax tree: data structure

• 2 fundamental classes• while parser eats through the token stream, it grows, i.e., builds up (at least conceptually)

the parse tree:

Bottom-up

Parse tree is being grown from the leaves to the root.

Top-down

Parse tree is being grown from the root to the leaves.

4 Parsing4.1 Introduction to parsing 111

AST

Parsing restricted classes of CFGs

• parser: better be “efficient”• full complexity of CFLs: not really needed in practice• classification of CF languages vs. CF grammars, e.g.:

– left-recursion-freedom: condition on a grammar– ambiguous language vs. ambiguous grammar

• classification of grammars ⇒ classification of languages– a CF language is (inherently) ambiguous, if there’s no unambiguous grammar for it– a CF language is top-down parseable, if there exists a grammar that allows top-down

parsing . . .

• in practice: classification of parser generating tools:– based on accepted notation for grammars: (BNF or some form of EBNF etc.)

Concerning the need (or the lack of need) for very expressive grammars, one should consider thefollowing: if a parser has trouble to figure out if a program has a syntax error or not (perhapsusing back-tracking), probably humans will have similar problems. So better keep it simple. Andtime in a compiler may be better spent elsewhere (optimization, semantical analysis).

Classes of CFG grammars/languages

• maaaany have been proposed & studied, including their relationships• lecture concentrates on

– top-down parsing, in particular∗ LL(1)∗ recursive descent

– bottom-up parsing∗ LR(1)∗ SLR∗ LALR(1) (the class covered by yacc-style tools)

• grammars typically written in pure BNF

Relationship of some grammar (not language) classes

unambiguous ambiguous

LR(k)LR(1)

LALR(1)SLRLR(0)

LL(0)

LL(1)LL(k)

112 4 Parsing4.2 Top-down parsing

taken from [4]

4.2 Top-down parsing

General task (once more)

• Given: a CFG (but appropriately restricted)• Goal: “systematic method” s.t.

1. for every given word w: check syntactic correctness2. [build AST/representation of the parse tree as side effect]3. [do reasonable error handling]

Schematic view on “parser machine”

. . . if 1 + 2 ∗ ( 3 + 4 ) . . .

q0q1

q2

q3 . . .qn

Finite control

. . .

unbounded extra memory (stack)

q2


Note: sequence of tokens (not characters)

Derivation of an expression

Derivation

The slides contain some big series of overlays, showing the derivation. This derivationprocess is not reproduced here (resp. only a few slides later as some big array of steps).

factors and terms

exp → term exp′exp′ → addop term exp′ | ε

addop → + | −term → factor term′

term′ → mulop factor term′ | εmulop → ∗factor → ( exp ) | n

(4.1)

4 Parsing4.2 Top-down parsing 113

Remarks concerning the derivation

Note:

• input = stream of tokens• there: 1 . . . stands for token class number (for readability/concreteness), in the

grammar: just number• in full detail: pair of token class and token value 〈number, 1〉

Notation:

• underline: the place (occurrence of non-terminal where production is used)• (((((

(crossed out:– terminal = token is considered treated– parser “moves on”– later implemented as match or eat procedure

Not as a “film” but at a glance: reduction sequence

exp ⇒term exp′ ⇒factor term′ exp′ ⇒((((number term′ exp′ ⇒numberterm′ exp′ ⇒number�ε exp′ ⇒numberexp′ ⇒numberaddop term exp′ ⇒number�+ term exp′ ⇒number +term exp′ ⇒number +factor term′ exp′ ⇒number +((((number term′ exp′ ⇒number +numberterm′ exp′ ⇒number +numbermulop factor term′ exp′ ⇒number +number�∗ factor term′ exp′ ⇒number +number ∗ ( exp ) term′ exp′ ⇒number +number ∗ �( exp ) term′ exp′ ⇒number +number ∗ ( exp ) term′ exp′ ⇒number +number ∗ ( term exp′ ) term′ exp′ ⇒number +number ∗ ( factor term′ exp′ ) term′ exp′ ⇒number +number ∗ (((((number term′ exp′ ) term′ exp′ ⇒number +number ∗ ( numberterm′ exp′ ) term′ exp′ ⇒number +number ∗ ( number�ε exp′ ) term′ exp′ ⇒number +number ∗ ( numberexp′ ) term′ exp′ ⇒number +number ∗ ( numberaddop term exp′ ) term′ exp′ ⇒number +number ∗ ( number�+ term exp′ ) term′ exp′ ⇒number +number ∗ ( number + term exp′ ) term′ exp′ ⇒number +number ∗ ( number + factor term′ exp′ ) term′ exp′ ⇒number +number ∗ ( number +((((number term′ exp′ ) term′ exp′ ⇒number +number ∗ ( number + numberterm′ exp′ ) term′ exp′ ⇒number +number ∗ ( number + number�ε exp′ ) term′ exp′ ⇒number +number ∗ ( number + numberexp′ ) term′ exp′ ⇒number +number ∗ ( number + number�ε ) term′ exp′ ⇒number +number ∗ ( number + number�) term′ exp′ ⇒number +number ∗ ( number + number ) term′ exp′ ⇒number +number ∗ ( number + number ) �ε exp′ ⇒number +number ∗ ( number + number ) exp′ ⇒number +number ∗ ( number + number ) �ε ⇒number +number ∗ ( number + number )


Besides this derivation sequence, the slide version contains also an “overlay” version, expanding the sequence stepby step. The derivation is a left-most derivation.

Best viewed as a tree

exp

term

factor

Nr

term′

ε

exp′

addop

+

term

factor

Nr

term′

mulop

∗

factor

( exp

term

factor

Nr

term′

ε

exp′

addop

+

term

factor

Nr

term′

ε

exp′

ε

)

term′

ε

exp′

ε

The tree does no longer contain information, which parts have been expanded first. Inparticular, the information that we have concretely done a left-most derivation whenbuilding up the tree in a top-down fashion is not part of the tree (as it is not important).The tree is an example of a parse tree as it contains information about the derivationprocess using rules of the grammar.

Non-determinism?

• not a “free” expansion/reduction/generation of some word, but– reduction of start symbol towards the target word of terminals

exp ⇒∗ 1 + 2 ∗ (3 + 4)

– i.e.: input stream of tokens “guides” the derivation process (at least it fixes thetarget)

• but: how much “guidance” does the target word (in general) gives?

Oracular derivation

exp → exp + term | exp − term | termterm → term ∗ factor | factor

factor → ( exp ) | number


exp ⇒1 ↓ 1 + 2 ∗ 3exp + term ⇒3 ↓ 1 + 2 ∗ 3term + term ⇒5 ↓ 1 + 2 ∗ 3factor + term ⇒7 ↓ 1 + 2 ∗ 3number + term ↓ 1 + 2 ∗ 3number + term 1 ↓ +2 ∗ 3number + term ⇒4 1+ ↓ 2 ∗ 3number + term ∗ factor ⇒5 1+ ↓ 2 ∗ 3number + factor ∗ factor ⇒7 1+ ↓ 2 ∗ 3number + number ∗ factor 1+ ↓ 2 ∗ 3number + number ∗ factor 1 + 2 ↓ ∗3number + number ∗ factor ⇒7 1 + 2∗ ↓ 3number + number ∗number 1 + 2∗ ↓ 3number + number ∗number 1 + 2 ∗ 3 ↓

The derivation shows a left-most derivation. Again, the “redex” is underlined. In addition,we show on the right-hand column the input and the progress which is being done on thatinput. The subscripts on the derivation arrows indicate which rule is chosen in thatparticular derivation step.

The point of the example is the following: Consider lines 7 and 8, and the steps the parserdoes. In line 7, it is about to expand term which is the left-most terminal. Looking intothe “future” the unparsed part is 2 * 3. In that situation, the parser chooses production4 (indicated by⇒4). In the next line, the left-most non-terminal is term again and also thenon-processed input has not changed. However, in that situation, the “oracular” parserchooses ⇒5.

What does that mean? It means, that the look-ahead did not help the parser! It used alllook-ahead there is, namely until the very end of the word. And it still cannot make theright decision with all the knowledge available at that given point. Note also: choosingwrongly (like ⇒5 instead of ⇒4 or the other way around) would lead to a failed parse(which would require backtracking). That means, it’s unparseable without backtracking(and not amount of look-ahead will help), at least we need backtracking, if we do left-derivations and top-down.

Right-derivations are not really an option, as typically we want to eat the input left-to-right. Secondly, right-most derivations will suffer from the same problem (perhaps not forthe very grammar but in general, so nothing would even be gained.)

On the other hand: bottom-up parsing later works on different principles, so the particularproblem illustrate by this example will not bother that style of parsing (but there are otherchallenges then).

So, what is the problem then here? The reason why the parser could not make a uniformdecision (for example comparing line 7 and 8) comes from the fact that these two particularlines are connected by ⇒4, which corresponds to the production

term → term ∗ factor

there the derivation step replaces the left-most term by term again without moving aheadwith the input. This form of rule is said to be left-recursive (with recursion on term).This is something that recursive descent parsers cannot deal with (or at least not withoutdoing backtracking, which is not an option).


Note also: the grammar is not ambigious (without proof). If a grammar is ambiguous,also then parsing won’t work properly (in this case neither will bottom-up parsing), butambiguity is not the problem right here.

We will learn how to transform grammars automatically to remove left-recursion. It’san easy construction. Note, however, that the construction not necessarily results in agrammar that afterwards is top-down parsable. It simple removes a “feature” of thegrammar which definitely cannot be treated by top-down parsing.

As side remark, for being super-precise: If a grammar contains left-recursion on a non-terminal which is “irrelevant” (i.e., no word will ever lead to a parse invovling that par-ticular non-terminal), in that case, obviously, the left-recursion does not hurt. Of course,the grammar in that case would be “silly”. We in general do not consider grammars whichcontain such irrelevant symbols (or have other such obviously meaningless defects). Butunless we exclude such silly grammars, it’s not 100% true that grammars with left-recursioncannot be treated via top-down parsing. But apart from that, it’s the case:

left-recursion destroys top-down parseability

(when based on left-most derivations/left-to-right parsing as it is always done for top-down).

Two principle sources of non-determinism

Using production A→ β

S ⇒∗ α1 A α2 ⇒ α1 β α2 ⇒∗ w

Conventions

• α1, α2, β: word of terminals and nonterminals• w: word of terminals, only• A: one non-terminal

2 choices to make

1. where, i.e., on which occurrence of a non-terminal in α1Aα2 to apply a pro-duction

2. which production to apply (for the chosen non-terminal).

Note that α1 and α2 may contain non-terminals, including further occurrences of A. How-ever, the words w1 and w2 contain terminals, only. By convention, A, B, etc. are non-terminal symbols, w . . . are words of terminals, and greek-lettered symbols α, β . . .represent words of terminals and non-terminals.


Left-most derivation

• that’s the easy part of non-determinism• taking care of “where-to-reduce” non-determinism: left-most derivation• notation ⇒l

• some of the example derivations earlier used that

Non-determinism vs. ambiguity

• Note: the “where-to-reduce”-non-determinism 6= ambiguitiy of a grammar• in a way (“theoretically”): where to reduce next is irrelevant:

– the order in the sequence of derivations does not matter– what does matter: the derivation tree (aka the parse tree)

Lemma 4.2.1 (Left or right, who cares). S ⇒∗l w iff S ⇒∗r w iff S ⇒∗ w.

• however (“practically”): a (deterministic) parser implementation: must make achoice

Using production A→ β

S ⇒∗ α1 A α2 ⇒ α1 β α2 ⇒∗ w

S ⇒∗l w1 A α2 ⇒ w1 β α2 ⇒∗l w

Remember the notational conventions used here: w stand for words containing terminalsonly, whereas α represents arbitrary words.

What about the “which-right-hand side” non-determinism?

A→ β | γ

Is that the correct choice?

S ⇒∗l w1 A α2 ⇒ w1 β α2 ⇒∗l w

• reduction with “guidance”: don’t loose sight of the target w– “past” is fixed: w = w1w2– “future” is not:

Aα2 ⇒l βα2 ⇒∗l w2 or else Aα2 ⇒l γα2 ⇒∗l w2 ?

Needed (minimal requirement):

In such a situation, “future target” w2 must determine which of the rules to take!

118 4 Parsing4.3 First and follow sets

Deterministic, yes, but still impractical

Aα2 ⇒l βα2 ⇒∗l w2 or else Aα2 ⇒l γα2 ⇒∗l w2 ?

• the “target” w2 is of unbounded length!⇒ impractical, therefore:

Look-ahead of length k

resolve the “which-right-hand-side” non-determinism inspecting only fixed-length prefix ofw2 (for all situations as above)

LL(k) grammars

CF-grammars which can be parsed doing that.

Of course, one can always write a parser that “just makes some decision” based on lookingahead k symbols. The question is: will that allow to capture all words from the grammarand only those.

4.3 First and follow sets

The considerations leading to a useful criterion for top-down parsing with backtrackingwill involve the definition of the so-called “first-sets”. In connection with that definition,there will be also the (related) definition of follow-sets.

We had a general look of what a look-ahead is, and how it helps in top-down parsing. Wealso saw that left-recursion is bad for top-down parsing (in particular, there can’t be anylook-ahead to help the parser). The definition discussed so far, being based on arbitraryderivations, were impractical. What is needed is a criterion, not on derivations, but ongrammars that can be used to figure out, whether the grammar is parseable in a top-downmanner with a look-ahead of, say k. Actually we will concentrate on a look-ahead of k = 1,which is practically a decent thing to do.

The definitions, as mentioned, will help to figure out if a grammar is top-down parseable.Such a grammar will then be called an LL(1) grammar. One could straightforwardlygeneralize the definition to LL(k) (which would include generalizations of the first andfollow sets), but that’s not part of the pensum. Note also: the first and follow set definitionwill also be used when discussing bottom-up parsing later.

Besides that, in this section we will also discuss what to do if the grammar is not LL(1).That will lead to a transformation removing left-recursion. That is not the only defectthat one wants to transform away. A second problem that is a show-stopper for LL(1)-parsing is known as “common left factors”. If a grammar suffers from that, there is anothertransformation called left factorization which can remedy that.

4 Parsing4.3 First and follow sets 119

First and Follow sets

• general concept for grammars• certain types of analyses (e.g. parsing):

– info needed about possible “forms” of derivable words,

First-set of A

which terminal symbols can appear at the start of strings derived from a given nonterminalA

Follow-set of A

Which terminals can follow A in some sentential form.

Remarks

• sentential form: word derived from grammar’s starting symbol• later: different algos for first and follow sets, for all non-terminals of a given grammar• mostly straightforward• one complication: nullable symbols (non-terminals)• Note: those sets depend on grammar, not the language

First sets

Definition 4.3.1 (First set). Given a grammar G and a non-terminal A. The first-set ofA, written FirstG(A) is defined as

FirstG(A) = {a | A⇒∗G aα, a ∈ ΣT }+ {ε | A⇒∗G ε} . (4.2)

Definition 4.3.2 (Nullable). Given a grammar G. A non-terminal A ∈ ΣN is nullable, ifA⇒∗ ε.

Nullable

The definition here of being nullable refers to a non-terminal symbol. When concentratingon context-free grammars, as we do for parsing, that’s basically the only interesting case.In principle, one can define the notion of being nullable analogously for arbitrary wordsfrom the whole alphabet Σ = ΣT + ΣN . The form of productions in CFGs makes itobvious, that the only words which actually may be nullable are words containing onlynon-terminals. Once a terminal is derived, it can never be “erased”. It’s equally easy tosee, that a word α ∈ Σ∗N is nullable iff all its non-terminal symbols are nullable. The sameremarks apply to context-sensitive (but not general) grammars.


For level-0 grammars in the Chomsky-hierarchy, also words containing terminal symbolsmay be nullable, and nullability of a word, like most other properties in that stetting,becomes undecidable.

First and follow sets

One point worth noting is that the first and the follow sets, while seemingly quite similar,differ in one important aspect (the follow set definition will come later). The first setis about words derivable from a given non-terminal A. The follow set is about wordsderivable from the starting symbol! As a consequence, non-terminals A which are notreachable from the grammar’s starting symbol have, by definition, an empty follow set. Incontrast, non-terminals unreachable from a/the start symbol may well have a non-emptyfirst-set. In practice a grammar containing unreachable non-terminals is ill-designed, sothat distinguishing feature in the definition of the first and the follow set for a non-terminalmay not matter so much. Nonetheless, when implementing the algo’s for those sets, thosesubtle points do matter! In general, to avoid all those fine points, one works with grammarssatisfying a number of common-sense restructions. One are so called reduced grammars,where, informally, all symbols “play a role” (all are reachable, all can derive into a wordof terminals).

Examples

• Cf. the Tiny grammar• in Tiny, as in most languages

First(if -stmt) = {”if”}

• in many languages:

First(assign-stmt) = {identifier, ”(”}

• typical Follow (see later) for statements:

Follow(stmt) = {”; ”, ”end”, ”else”, ”until”}

Remarks

• note: special treatment of the empty word ε• in the following: if grammar G clear from the context

– ⇒∗ for ⇒∗G– First for FirstG– . . .

• definition so far: “top-level” for start-symbol, only• next: a more general definition

– definition of First set of arbitrary symbols (and even words)


– and also: definition of First for a symbol in terms of First for “other symbols”(connected by productions)

⇒ recursive definition

A more algorithmic/recursive definition

• grammar symbol X: terminal or non-terminal or ε

Definition 4.3.3 (First set of a symbol). Given a grammar G and grammar symbol X.The first-set of X, written First(X), is defined as follows:

1. If X ∈ ΣT + {ε}, then First(X) contains X.2. If X ∈ ΣN : For each production

X → X1X2 . . . Xn

a) First(X) contains First(X1) \ {ε}b) If, for some i < n, all First(X1), . . . ,First(Xi) contain ε, then First(X) contains

First(Xi+1) \ {ε}.c) If all First(X1), . . . ,First(Xn) contain ε, then First(X) contains {ε}.

Recursive definition of First?

The following discussion may be ignored, if wished. Even if details and theory behind it isbeyond the scope of this lecture, it is worth considering the above definition more closely.One may even consider if it is a definition at all (resp. in which way it is a definition).

One naive first impression may be: it’s a kind of a “functional definition”, i.e., the aboveDefinition 4.3.3 gives a recursive definition of the function First. As discussed later,everything gets rather simpler if we would not have to deal with nullable words and ε-productions. For the point being explained here, let’s assume that there are no suchproductions and get rid of the special cases, cluttering up Definition 4.3.3. Removing theclutter gives the following simplified definition:

Definition 4.3.4 (First set of a symbol (no ε-productions)). Given a grammar G andgrammar symbol X. The First-set of X 6= ε, written First(X) is defined as follows:

1. If X ∈ ΣT , then First(X) ⊇ {X}.2. If X ∈ ΣN : For each production

X → X1X2 . . . Xn ,

First(X) ⊇ First(X1).

Compared to the previous condition, I did the following minor adaptation (apart fromcleaning up the ε’s): I replaced the English word “contains” with the superset relationsymbol ⊇.

Now, with Definition 4.3.4 as a simplified version of the original definition being madeslightly more explicit: in which way is that a definition at all?


For being a definition for First(X), it seems awfully lax. Already in (1), it “defines”that First(X) should “at least contain X”. A similar remark applies to case (2) fornon-terminals. Those two requirements are as such well-defined, but they don’t defineFirst(X) in a unique manner! Definition 4.3.4 defines what the set First(X) should atleast contain!

So, in a nutshell, one should not consider Definition 4.3.4 a “recursive definition ofFirst(X)” but rather

“a definition of recursive conditions on First(X), which, when satisfied, ensuresthat First(X) contains at least all non-terminals we are after”.

What we are really after is the smallest First(X) which satisfies those conditions of thedefinitions.

Now one may think: the problem is thats definition is just “sloppy”. Why does it use theword “contain” resp. the ⊇-relation, instead of requiring equality, i.e., =? While plausibleat first sight, unfortunately, whether we use ⊇ or set equality = in Definition 4.3.4 doesnot change anything.

Anyhow, the core of the matter is not = vs. ⊇. The core of the matter is that “Definition”4.3.4 is circular!

Considering that definition of First(X) as a plain functional and recursive definition ofa procedure missed the fact that grammar can, of course, contain “loops”. Actually, it’salmost a characterizing feature of reasonable context-free grammars (or even regular gram-mars) that they contain “loops” – that’s the way they can describe infinite languages.

In that case, obviously, considering Definition 4.3.3 with = instead of ⊇ as the recursivedefinition of a function leads immediately to an “infinite regress”, the recursive functionwon’t terminate. So again, that’s not helpful.

Technically, such a definition can be called a recursive constraint (or a constraint system,if one considers the whole definition to consist of more than one constraint, namely fordifferent terminals and for different productions).

For words

Definition 4.3.5 (First set of a word). Given a grammar G and word α. The first-set of

α = X1 . . . Xn ,

written First(α) is defined inductively as follows:

1. First(α) contains First(X1) \ {ε}2. for each i = 2, . . . n, if First(Xk) contains ε for all k = 1, . . . , i − 1, then First(α)

contains First(Xi) \ {ε}3. If all First(X1), . . . ,First(Xn) contain ε, then First(X) contains {ε}.


Concerning the definition of First

The definition here is of course very close to the definition of the inductive case of theprevious definition, i.e., the first set of a non-terminal. Whereas the previous definitionwas a recursive, this one is not.

Note that the word αmay be empty, i.e., n = 0, In that case, the definition gives First(ε) ={ε} (due to the 3rd condition in the above definition). In the definitions, the empty word εplays a specific, mostly technical role. The original, non-algorithmic version of Definition4.3.1, makes it already clear, that the first set not precisely corresponds to the set ofterminal symbols that can appear at the beginning of a derivable word. The correctintuition is that it corresponds to that set of terminal symbols together with ε as a specialcase, namely when the initial symbol is nullable.

That may raise two questions. 1) Why does the definition makes that as special case,as opposed to just using the more “straightforward” definition without taking care of thenullable situation? 2) What role does ε play here?

The second question has no “real” answer, it’s a choice which is being made which couldbe made differently. What the definition from equation (4.3.1) in fact says is: “give the setof terminal symbols in the derivable word and indicate whether or not the start symbolis nullable.” The information might as well be interpreted as a pair consisting of a setof terminals and a boolean (indicating nullability). The fact that the definition of Firstas presented here uses ε to indicate that additional information is a particular choice ofrepresentation (probably due to historical reasons: “they always did it like that . . . ”). Forinstance, the influential “Dragon book” [1, Section 4.4.2] uses the ε-based definition. Thetexbooks [3] (and its variants) don’t use ε as indication for nullability.

In order that this definition works, it is important, obviously, that ε is not a terminalsymbol, i.e., ε /∈ ΣT (which is generally assumed).

Having clarified 2), namely that using ε is a matter of conventional choice, remains question1), why bother to include nullability-information in the definition of the first-set at all,why bother with the “extra information” of nullability? For that, there is a real technicalreason: For the recursive definitions to work, we need the information whether or not asymbol or word is nullable, therefore it’s given back as information.

As a further point concerning the first sets: The slides give 2 definitions, Definition 4.3.1and Definition 4.3.3. Of course they are intended to mean the same. The second versionis a more recursive or algorithmic version, i.e., closer to a recursive algorithm. If onetakes the first one as the “real” definition of that set, in principle we would be obligedto prove that both versions actually describe the same same (resp. that the recurivedefinition implements the original definition). The same remark applies also to the non-recursive/iterative code that is shown next.

Pseudo code


for all X ∈ A ∪ {ε} doF i r s t [X] := X

end ;

for all non-terminals A doF i r s t [A] := {}

endwhile there are changes to any F i r s t [A] do

for each production A→ X1 . . . Xn dok := 1 ;cont inue := truewhile cont inue = true and k ≤ n do

F i r s t [A] := F i r s t [A] ∪ F i r s t [Xk ] \ {ε}i f ε /∈ F i r s t [Xk ] then cont inue := fa l sek := k + 1

end ;i f cont inue = truethen F i r s t [A] := F i r s t [A] ∪ {ε}

end ;end

If only we could do away with special cases for the empty words . . .

for a grammar without ε-productions.1

for all non-terminals A doF i r s t [A] := {} // counts as change

endwhile there are changes to any F i r s t [A] do

for each production A→ X1 . . . Xn doF i r s t [A] := F i r s t [A] ∪ F i r s t [X1 ]

end ;end

This simplification is added for illustration. What makes the algorithm slightly more thanjust immediate is the fact that symbols can be nullable (non-terminals can be nullable). Ifwe don’t have ε-transitions, then no symbol is nullable. Under this simplifying assumption,the algorithm looks quite simpler. We don’t need to check for nullability (i.e., we don’tneed to check if ε is part of the first sets), and moreover, we can do without the innerwhile-loop, walking down the right-hand side of the production as long as the symbolsturn out to be nullable (since we know they are not).

Example expression grammar (from before)



(4.3)

1A production of the form A→ ε.


Example expression grammar (expanded)

exp → exp addop termexp → term

addop → +addop → −term → term mulop factorterm → factor

mulop → ∗factor → ( exp )factor → n

(4.4)

“Run” of the algo

nr pass 1 pass 2 pass 3

1 exp → exp addop term

2 exp → term

3 addop → +

4 addop → −

5 term → term mulop factor

6 term → factor

7 mulop → ∗

8 factor → ( exp )

9 factor → n

How the algo works

The first thing to observe: the grammar does not contain ε-productions. That, veryfortunately, simplifies matters considerably! It should also be noted that the table fromabove is a schematic illustration of a particular execution strategy of the pseudo-code.The pseudo-code itself leaves out details of the evaluation, notably the order in whichnon-deterministic choices are done by the code. The main body of the pseudo-code isgiven by two nested loops. Even if details (of data structures) are not given, one possibleway of interpreting the code is as follows: the outer while-loop figures out which of theentries in the First-array have “recently” been changed, remembers that in a “collection”of non-terminals A’s, and that collection is then worked off (i.e. iterated over) on the innerloop. Doing it like that leads to the “passes” shown in the table. In other words, the twodimensions of the table represent the fact that there are 2 nested loops.


Having said that: it’s not the only way to “traverse the productions of the grammar”. Onecould arrange a version with only one loop and a collection data structure, which containsall productions A → X1 . . . Xn such that First[A] has “recently been changed”. Thatdata structure therefore contains all the productions that “still need to be treated”. Sucha collection data structure containing “all the work still to be done” is known as work-list, even if it needs not technically be a list. It can be a queue, i.e., following a FIFOstrategy, it can be a stack (realizing LIFO), or some other strategy or heuristic. Possible isalso a randomized, i.e., non-deterministic strategy (which is sometimes known as chaoticiteration).


Collapsing the rows & final result

• results per pass:


1 2 3exp {(,n}addop {+,−}term {(,n}mulop {∗}factor {(,n}

• final results (at the end of pass 3):

First[_]exp {(,n}addop {+,−}term {(,n}mulop {∗}factor {(,n}

The tables show 3 passes, and the result correspond to the state at the end of the 3rdpass. Technically, the algorithim cannot “know” that at the end of the 3rd pass, the resulthas been achieved. It has to run a 4th time, at which point it it’s clear that there is nochange from the 3rd round to the 4th round, which also means, that any further roundswould not give more information. The information has stabilized (at round 3) and thatbecomes clear at round 4 (at which point, the algo terminates).

Work-list formulation

for all non-terminals A doF i r s t [A] := {}WL := P // a l l product ions

endwhile WL 6= ∅ do

remove one (A→ X1 . . . Xn) from WLi f F i r s t [A] 6= F i r s t [A] ∪ F i r s t [X1]then F i r s t [A] := F i r s t [A] ∪ F i r s t [X1]

add a l l product ions (A→ X ′1 . . . X′m) to WL

else skipend

• no ε-productions• worklist here: “collection” of productions• alternatively, with slight reformulation: “collection” of non-terminals instead also

possible


Follow sets

Definition 4.3.6 (Follow set). Given a grammar G with start symbol S, and a non-terminal A.

The follow-set of A, written FollowG(A), is

FollowG(A) = {a | S $⇒∗G α1Aaα2, a ∈ ΣT + {$ }} . (4.5)

• $ as special end-marker

• typically: start symbol not on the right-hand side of a production

Special symbol $

The symbol $ can be interpreted as “end-of-file” (EOF) token. It’s standard to assumethat the start symbol S does not occur on the right-hand side of any production. In thatcase, the follow set of S contains $ as only element. Note that the follow set of othernon-terminals may well contain $.

As said, it’s common to assume that S does not appear on the right-hand side of anyproduction. For a start, S won’t occur “naturally” there anyhow in practical programminglanguage grammars. Furthermore, with S occuring only on the left-hand side, the grammarhas a slightly nicer shape insofar as it makes its algorithmic treatment slightly nicer.It’s basically the same reason why one sometimes assumes that, for instance, control-flow graphs have one “isolated” entry node (and/or an isolated exit node), where beingisolated means, that no edge in the graph goes (back) into into the entry node; for exitsnodes, the condition means, no edge goes out. In other words, while the graph can ofcourse contain loops or cycles, the entry node is not part of any such loop. That is donelikewise to (slightly) simplify the treatment of such graphs. Slightly more generally andalso connected to control-flow graphs: similar conditions about the shape of loops (not justfor the entry and exit nodes) have been worked out, which play a role in loop optimizationand intermediate representations of a compiler, such as static single assignment forms.

Coming back to the condition here concerning $: even if a grammar would not immediatlyadhere to that condition, it’s trivial to transform it into that form by adding anothersymbol and make that the new start symbol, replacing the old. We will do that sometimesin exercises and examples later

Follow sets, recursively

Definition 4.3.7 (Follow set of a non-terminal). Given a grammar G and nonterminalA. The Follow-set of A, written Follow(A) is defined as follows:

1. If A is the start symbol, then Follow(A) contains $.2. If there is a production B → αAβ, then Follow(A) contains First(β) \ {ε}.3. If there is a production B → αAβ such that ε ∈ First(β), then Follow(A) contains

Follow(B).

• $: “end marker” special symbol, only to be contained in the follow set


More imperative representation in pseudo code

Follow [S ] := {$}for all non-terminals A 6= S doFollow [A ] := {}

endwhile there are changes to any Follow−s e t do

for each production A→ X1 . . . Xn dofor each Xi which i s a non−t e rmina l doFollow [Xi ] := Follow [Xi ]∪( F i r s t (Xi+1 . . . Xn) \ {ε})i f ε ∈ F i r s t (Xi+1Xi+2 . . . Xn )then Follow [Xi ] := Follow [Xi ] ∪ Follow [A ]

endend

end

Note! First() = {ε}


nr pass 1 pass 2

1 exp → exp addop term

2 exp → term

5 term → term mulop factor

6 term → factor

8 factor → ( exp )

Explanations

The table omits productions which have terminals only on their right-hand side. The algodoes not do anything in those cases anyway. The grammar does not contain nullable sym-bols, which means, the algo is a bit more simple. We remember, that the first-procedureused ε for nullable symbol. However, the first procedure here is used non on non-terminals,but on words. And that word Xi+1 . . . Xn may itself be ε, and that is where the last clauseof the algo kicks in.


Recursion vs. iteration


Illustration of first/follow sets

• red arrows: illustration of information flow in the algos• run of Follow:

– relies on First– in particular a ∈ First(E) (right tree)

• $ ∈ Follow(B)

4 Parsing4.4 Massaging grammars 131

The two trees are just meant a illustrations (but still correct). The grammar itself is notgiven, but the tree shows relevant productions.

In case of the tree on the left (for the first sets): A is the root and must therefore be thestart symbol. Since the root A has three children C, D, and E, there must be a productionA→ C D E. etc.

The first-set definition would “immediately” detect that F has a in its first-set, i.e., allwords derivable starting from F start with an a (and actually with no other terminal, asF is mentioned only once in that sketch of a tree). At any rate, only after determiningthat a is in the first-set of F , then it can enter the first-set of C, etc. and in this waypercolating upwards the tree.

Note that the tree is insofar specific, in that all the internal nodes are different non-terminals. In more realistic settings, different nodes would represent the same non-terminal. And also in this case, one can think of the information percolating up.

More complex situation (nullability)

In the tree on the left, B,M,N,C, and F are nullable. That is marked in that the resultingfirst sets contain ε. There will also be exercises about that.

4.4 Massaging grammars

We have learned the first- and follow-set as “tools” to diagnose the shape of a grammar.In particular the follow-set is connected with the notion of look-ahead, on which we havetouched upon earlier when sketching how generally a parser works. To make decisionsconcerning which “derivation step” is relevant to build up the parse tree, while eatingthrough the token stream. The general picture applies to both bottom-up and top-downparsing, which implies, the first- and follow-sets play a role as “diagnosis instrument” forboth kinds of parsings.

132 4 Parsing4.4 Massaging grammars

By diagnosis, I mean in particular: the concepts can be used to check whether or not it’spossible to make parse a given grammar with a look-ahead of one symbol. The wholepicture could more or less straightforwardly be generalized for a longer look-ahead: top-down parsing or bottom-up parsing with a look-ahead of k would require approporiategeneralizations of the first-sets and follow-sets to speak not about k = 1 symbol butlonger words. In practice, one mostly is content with k = 1, which is also why we don’tbother about generalizing the setting. And actually, of one understands the concept ofone look-ahead, nothing conceptually changes when going to k > 1.

As said, the first- and follow set are relevant for both top-down and bottom-up parsers.Here, however, we are in the part covering top-down parsing, which has slight differentchallenges than bottom-up. Before we come actually to top-down parsing, we discuss, whatare problematic pattern in grammars, i.e., patterns that top-down parser have troubleswith, and we use the notions follow sets to shed light on that. The two troublesomepattern we will discuss that way are left-recursive grammars and grammars with commonleft factors. We will also dicsuss, how to massage troublesome grammars in a way to getrid of those patterns.

Some forms of grammars are less desirable than others

• left-recursive production:

A→ Aα

more precisely: example of immediate left-recursion

• 2 productions with common “left factor”:

A→ αβ1 | αβ2 where α 6= ε

Left-recursive and unfactored grammars

At the current point in the presentation, the importance of those conditions might notyet be clear (but remember the discussion around “oracular” derivations). In general, it’sthat certain kind of parsing techniques require absence of left-recursion and of commonleft-factors. Note also that a left-linear production is a special case of a production withimmediate left recursion. In particular, recursive descent parsers would not work withleft-recursion. For that kind of parsers, left-recursion needs to be avoided.

Why common left-factors are undesirable should at least intuitively be clear: we see thisalso on the next slide (the two forms of conditionals). It’s intuitively clear, that a parser,when encountering an if (and the following boolean condition and perhaps the thenclause) cannot decide immediately which rule applies. It should also be intiutively clearthat that’s what a parser does: inputting a stream of tokens and trying to figure out whichsequence of rules are responsible for that stream (or else reject the input). The amountof additional information, at each point of the parsing process, to determine which ruleis responsible next is called the look-ahead. Of course, if the grammar is ambiguous, no


unique decision may be possible (no matter the look-ahead). Ambiguous grammars aregenerally unwelcome as specification for parsers.

On a very high level, the situation can be compared with the situation for regular lan-guages/automata. Non-deterministic automata may be ok for specifying a language (theycan more easily be connected to regular expressions), but they are not so useful for specify-ing a scanner program. There, deterministic automata are necessary. Here, grammars withleft-recursion, grammars with common factors, or even ambiguous grammars may be ok forspecifying a context-free language. For instance, ambiguity may be caused by unspecifiedprecedences or non-associativity. Nonetheless, how to obtain a grammar representationmore suitable to be more or less directly translated to a parser is an issue less clear cutcompared to regular languages. Already the question whether or not a given grammar isambiguous or not is undecidable. If ambiguous, there’d be no point in turning it into apractical parser. Also the question, what’s an acceptable form of grammar depends onwhat class of parsers one is after (like a top-down parser or a bottom-up parser).

Some simple examples for both

• left-recursion

exp → exp + term

• classical example for common left factor: rules for conditionals

if -stmt → if ( exp ) stmt end| if ( exp ) stmt else stmt end

We had a version of conditionals earlier, there

Transforming the expression grammar



• obviously left-recursive• remember: this variant used for proper associativity!


After removing left recursion




• still unambiguous• unfortunate: associativity now different!• note also: ε-productions & nullability

Left-recursion removal

Left-recursion removal

A transformation process to turn a CFG into one without left recursion

Explanation

• price: ε-productions• 3 cases to consider

– immediate (or direct) recursion∗ simple∗ general

– indirect (or mutual) recursion

Left-recursion removal: simplest case

Before

A → Aα | β

After

A → βA′

A′ → αA′ | ε


Schematic representation

A → Aα | β

A

A

A

A

β

α

α

α

A → βA′

A′ → αA′ | ε

A

β A′

α A′

α A′

α A′

ε

Remarks

• both grammars generate the same (context-free) language (= set of words over ter-minals)

• in EBNF:

A→ β{α}

• two negative aspects of the transformation1. generated language unchanged, but: change in resulting structure (parse-tree),

i.a.w. change in associativity, which may result in change of meaning2. introduction of ε-productions

• more concrete example for such a production: grammar for expressions

Left-recursion removal: immediate recursion (multiple)

Before

A → Aα1 | . . . | Aαn| β1 | . . . | βm

space

After

A → β1A′ | . . . | βmA′A′ → α1A′ | . . . | αnA′

| ε


EBNF

Note: can be written in EBNF as:

A→ (β1 | . . . | βm)(α1 | . . . | αn)∗

Removal of: general left recursion

Assume non-terminals A1, . . . , Am

for i := 1 to m dofor j := 1 to i−1 do

replace each grammar rule of the form Ai → Ajβ by // i < jrule Ai → α1β | α2β | . . . | αkβ

where Aj → α1 | α2 | . . . | αk

is the current rule(s) for Aj // cur rentend{ corresponds to i = j }remove, if necessary, immediate left recursion for Ai

end

“current” = rule in the current stage of algo

Example (for the general case)

Let A = A1, B = A2.

A → Ba | Aa | cB → Bb | Ab | d

A → BaA′ | cA′A′ → aA′ | εB → Bb | Ab | d

A → BaA′ | cA′A′ → aA′ | εB → Bb | BaA′b | cA′b | d

A → BaA′ | cA′A′ → aA′ | εB → cA′bB′ | dB′B′ → bB′ | aA′bB′ | ε

Left factor removal

• CFG: not just describe a context-free languages• also: intended (indirect) description of a parser for that language⇒ common left factor undesirable• cf.: determinization of automata for the lexer


Simple situation

beforeA→ αβ | αγ | . . .

afterA → αA′ | . . .A′ → β | γ

Example: sequence of statements

sequences of statements

Beforestmt-seq → stmt ; stmt-seq

| stmt

Afterstmt-seq → stmt stmt-seq ′

stmt-seq ′ → ; stmt-seq | ε

Example: conditionals

Beforeif -stmt → if ( exp ) stmt-seq end

| if ( exp ) stmt-seq else stmt-seq end

Afterif -stmt → if ( exp ) stmt-seq else-or-end

else-or-end → else stmt-seq end | end

Example: conditionals (without else)

Beforeif -stmt → if ( exp ) stmt-seq

| if ( exp ) stmt-seq else stmt-seq

Afterif -stmt → if ( exp ) stmt-seq else-or-empty

else-or-empty → else stmt-seq | ε


Not all factorization doable in “one step”

Starting pointA → abcB | abC | aE

After 1 stepA → abA′ | aEA′ → cB | C

After 2 stepsA → aA′′A′′ → bA′ | EA′ → cB | C

longest left factor

• note: we choose the longest common prefix (= longest left factor) in the first step

Left factorization

while there are changes to the grammar dofor each nonterminal A do

let α be a prefix of max. length that is sharedby two or more productions for A

i f α 6= εthen

let A→ α1 | . . . | αn be allprod. for A and suppose that α1, . . . , αk share αso that A→ αβ1 | . . . | αβk | αk+1 | . . . | αn ,that the βj’s share no common prefix, andthat the αk+1, . . . , αn do not share α.

replace rule A→ α1 | . . . | αn by the rulesA→ αA′ | αk+1 | . . . | αn

A′ → β1 | . . . | βk

endend

end

The algorithm is pretty straightforward. The only thing to keep in might is that whatis called α in the pseudo-code needs to be the longest comment prefix and the β’s mustinclude all right-hand sides that start with that (common longest prefix) α.

4 Parsing4.5 LL-parsing (mostly LL(1)) 139

4.5 LL-parsing (mostly LL(1))

After having covered the more technical definitions of the first and follow sets and trans-formations to remove left-recursion resp. common left factors, we go back to top-downparsing, in particular to the specific form of LL(1) parsing.

Additionally, we discuss issues about abstract syntax trees vs. parse trees.

Parsing LL(1) grammars

• this lecture: we don’t do LL(k) with k > 1• LL(1): particularly easy to understand and to implement (efficiently)• not as expressive than LR(1) (see later), but still kind of decent

LL(1) parsing principle

Parse from 1) left-to-right (as always anyway), do a 2) left-most derivation and resolvethe “which-right-hand-side” non-determinism by 3) looking 1 symbol ahead.

• two flavors for LL(1) parsing here (both are top-down parsers)– recursive descent– table-based LL(1) parser

• predictive parsers

If one wants to be very precise: it’s recursive descent with one look-ahead and withoutbacktracking. It’s the single most common case for recursive descent parsers. Longerlook-aheads are possible, but less common. Technically, even allowing back-tracking canbe done using recursive descent as principle (even if not done in practice).

Sample expression grammar again

factors and terms




(4.6)

140 4 Parsing4.5 LL-parsing (mostly LL(1))

Look-ahead of 1: straightforward, but not trivial

• look-ahead of 1:– not much of a look-ahead, anyhow– just the “current token”

⇒ read the next token, and, based on that, decide• but: what if there’s no more symbols?⇒ read the next token if there is, and decide based on the token or else the fact that

there’s none left2

Example: 2 productions for non-terminal factor


That situation here is more or less trivial, but that’s not all to LL(1) . . .

Recursive descent: general set-up

1. global variable, say tok, representing the “current token” (or pointer to currenttoken)

2. parser has a way to advance that to the next token (if there’s one)

Idea

For each non-terminal nonterm, write one procedure which:

• succeeds, if starting at the current token position, the “rest” of the token streamstarts with a syntactically correct word of terminals representing nonterm

• fail otherwise

• ignored (for now): when doing the above successfully, build the AST for the acceptednonterminal.

Recursive descent (in C-like)

method factor for nonterminal factorf ina l int LPAREN=1,RPAREN=2,NUMBER=3,PLUS=4,MINUS=5,TIMES=6;

void factor ( ) {switch ( tok ) {case LPAREN: eat (LPAREN) ; expr ( ) ; eat (RPAREN) ;case NUMBER: eat (NUMBER) ;}

}

2Sometimes “special terminal” $ used to mark the end (as mentioned).


Recursive descent (in ocaml)

type token = LPAREN | RPAREN | NUMBER| PLUS | MINUS | TIMES

l et f a c t o r ( ) = (∗ f unc t i on f o r f a c t o r s ∗)match ! tok with

LPAREN −> eat (LPAREN) ; expr ( ) ; eat (RPAREN)| NUMBER −> eat (NUMBER)

Slightly more complex

• previous 2 rules for factor : situation not always as immediate as that

LL(1) principle (again)

given a non-terminal, the next token must determine the choice of right-hand side.

When talking about the next token, it must be the next token/terminal in the senseof First, but it need not be a token directly mentioned on the right-hand sides of thecorresponding rules.

⇒ definition of the First setLemma 4.5.1 (LL(1) (without nullable symbols)). A reduced context-free grammarwithout nullable non-terminals is an LL(1)-grammar iff for all non-terminals A andfor all pairs of productions A→ α1 and A→ α2 with α1 6= α2:

First1(α1) ∩ First1(α2) = ∅ .

Common problematic situation

• often: common left factors problematic

if -stmt → if ( exp ) stmt| if ( exp ) stmt else stmt

• requires a look-ahead of (at least) 2• ⇒ try to rearrange the grammar

1. Extended BNF ([9] suggests that)if -stmt → if ( exp ) stmt[else stmt]

1. left-factoring:

if -stmt → if ( exp ) stmt else−partelse−part → ε | else stmt


Recursive descent for left-factored if -stmt

procedure ifstmt ( )begin

match ( " i f " ) ;match ( " ( " ) ;exp ( ) ;match ( " ) " ) ;stmt ( ) ;i f token = " else "then match ( " else " ) ;

stmt ( )end

end ;

Left recursion is a no-go

factors and terms



(4.7)

• consider treatment of exp: First(exp)?

• whatever is in First(term), is in First(exp)3 recursion.

Left-recursion

Left-recursive grammar never works for recursive descent.

Removing left recursion may help




3And it would not help to look-ahead more than 1 token either.


procedure exp ( )begin

term ( ) ;exp′ ( )

end

procedure exp′ ( )begin

case token of"+" : match ( "+ " ) ;

term ( ) ;exp′ ( )

"−" : match ( " − " ) ;term ( ) ;exp′ ( )

endend

Recursive descent works, alright, but . . .

exp

term

factor

Nr

term′

ε

exp′

addop

+

term

factor

Nr

term′

mulop

∗

factor

( exp

term

factor

Nr

term′

ε

exp′

addop

+

term

factor

Nr

term′

ε

exp′

ε

)

term′

ε

exp′

ε

. . . who wants this form of trees?

Left-recursive grammar with nicer parse trees

1 + 2 ∗ (3 + 4)


exp

exp

term

factor

Nr

addop

+

term

term

factor

Nr

mulop

∗

term

factor

( exp

Nr mulop

∗

Nr

)

The simple “original” expression grammar (even nicer)

Flat expression grammar


1 + 2 ∗ (3 + 4)

exp

exp

Nr

op

+

exp

exp

Nr

op

∗

exp

( exp

exp

Nr

op

+

exp

Nr

)

Associtivity problematic

The issues here, including associativity, have been touched upon already when discussingambiguity.

Precedence & assoc.




Formula

3 + 4 + 5

parsed “as”

(3 + 4) + 5

3− 4− 5

parsed “as”

(3− 4)− 5

Tree

exp

exp

exp

term

factor

number

addop

+

term

factor

number

addop

+

term

factor

number

exp

exp

exp

term

factor

number

addop

−

term

factor

number

addop

−

term

factor

number

Now use the grammar without left-rec (but right-rec instead)

No left-rec.





Formula

3− 4− 5

parsed “as”

3− (4− 5)

Tree

exp

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

ε

But if we need a “left-associative” AST?

• we want (3− 4)− 5, not 3− (4− 5)

exp

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

ε

3

4 -1

5

-6


Code to “evaluate” ill-associated such trees correctly

function exp′ ( v a l s o f a r : int ) : int ;begin

i f token = '+ ' or token = '− 'then

case token of'+ ' : match ( '+ ' ) ;

v a l s o f a r := va l s o f a r + term ;'− ' : match ( ' − ' ) ;

v a l s o f a r := va l s o f a r − term ;end case ;return exp′ ( v a l s o f a r ) ;

else return v a l s o f a rend ;

• extra “accumulator” argument valsofar• instead of evaluating the expression, one could build the AST with the appropriate

associativity instead:• instead of valueSoFar, one had rootOfTreeSoFar

The example parses expressions and evalutes them while doing that. In most cases in afull-fledged parser, one does not need a value as output of a successful parse-run, but anAST. But the issue of the fact, that sometimes the associativity is “the wrong way”. Alsothe “accumulator”-pattern illustrated here in the evaluation setting could help out withAST

“Designing” the syntax, its parsing, & its AST

trade offs:

1. starting from: design of the language, how much of the syntax is left “implicit”42. which language class? Is LL(1) good enough, or something stronger wanted?3. how to parse? (top-down, bottom-up, etc.)4. parse-tree/concrete syntax trees vs. ASTs

AST vs. CST

• once steps 1.–3. are fixed: parse-trees fixed!• parse-trees = essence of grammatical derivation process• often: parse trees only “conceptually” present in a parser• AST:

– abstractions of the parse trees– essence of the parse tree

4Lisp is famous/notorious in that its surface syntax is more or less an explicit notation for the ASTs. Notthat it was originally planned like this . . .


– actual tree data structure, as output of the parser– typically on-the fly: AST built while the parser parses, i.e. while it executes a

derivation in the grammar

AST vs. CST/parse tree

Parser "builds" the AST data structure while "doing" the parse tree

AST: How “far away” from the CST?

• AST: only thing relevant for later phases ⇒ better be clean . . .• AST “=” CST?

– building AST becomes straightforward– possible choice, if the grammar is not designed “weirdly”,

exp

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

addop

−

term

factor

number

term′

ε

exp′

ε

3

4 -1

5

-6

parse-trees like that better be cleaned up as AST

exp

exp

exp

term

factor

number

addop

−

term

factor

number

addop

−

term

factor

number

slightly more reasonably looking as AST (but underlying grammar not directly useful forrecursive descent)


exp

exp

number

op

−

exp

exp

number

op

−

exp

number

That parse tree looks reasonable clear and intuitive

−

number −

number number

exp : −

exp : number exp : −

exp : number exp : number

Certainly minimal amount of nodes, which is nice as such. However, what is missing(which might be interesting) is the fact that the 2 nodes labelled “−” are expressions!

This is how it’s done (a recipe)

Assume, one has a “non-weird” grammar


• typically that means: assoc. and precedences etc. are fixed outside the non-weirdgrammar– by massaging it to an equivalent one (no left recursion etc.)– or (better): use parser-generator that allows to specify assoc . . . like “ "∗"

binds stronger than "+", it associates to the left . . . ” , without cluttering thegrammar.

• if grammar for parsing is not as clear: do a second one describing the ASTs

Remember (independent from parsing)

BNF describe trees


This is how it’s done (recipe for OO data structures)

Recipe

• turn each non-terminal to an abstract class• turn each right-hand side of a given non-terminal as (non-abstract) subclass of the

class for considered non-terminal• chose fields & constructors of concrete classes appropriately• terminal: concrete class as well, field/constructor for token’s value

Example in Java


abstract public class Exp {}

public class BinExp extends Exp { // exp −> exp op exppublic Exp l e f t , r i g h t ;public Op op ;public BinExp (Exp l , Op o , Exp r ) {

l e f t=l ; op=o ; r i g h t=r ; }}

public class ParentheticExp extends Exp { // exp −> ( op )public Exp exp ;public ParentheticExp (Exp e ) {exp = l ; }

}

public class NumberExp extends Exp { // exp −> NUMBERpublic number ; // token va luepublic Number( int i ) {number = i ; }

}

abstract public class Op { // non−t ermina l = a b s t r a c t}

public class Plus extends Op { // op −> "+"}

public class Minus extends Op { // op −> "−"}

public class Times extends Op { // op −> "∗"}

The latter classes are perhaps pushing it too far. It’s done to show that one can mechani-cally use the recipe once grammar is given, so it’s a clean solution (perhaps one get betterefficiency if one would not make classes / objects out of everything, though).


3− (4− 5)

Exp e = new BinExp (new NumberExp (3 ) ,new Minus ( ) ,new BinExp (new Parenthet icExpr (

new NumberExp (4 ) ,new Minus ( ) ,new NumberExp ( 5 ) ) ) )

Pragmatic deviations from the recipe

• it’s nice to have a guiding principle, but no need to carry it too far . . .• To the very least: the ParentheticExpr is completely without purpose: grouping

is captured by the tree structure⇒ that class is not needed• some might prefer an implementation of

op → + | − | ∗

as simply integers, for instance arranged likepublic class BinExp extends Exp { // exp −> exp op exp

public Exp l e f t , r i g h t ;public int op ;public BinExp (Exp l , int o , Exp r ) {

pos=p ; l e f t=l ; oper=o ; r i g h t=r ; }public f ina l stat ic int PLUS=0, MINUS=1, TIMES=2;

}

and used as BinExpr.PLUS etc.

Recipe for ASTs, final words:

• space considerations for AST representations are irrelevant in most cases• clarity and cleanness trumps “quick hacks” and “squeezing bits”• some deviation from the recipe or not, the advice still holds:

Do it systematically

A clean grammar is the specification of the syntax of the language and thus the parser.It is also a means of communicating with humans what the syntax of the language is,at least communicating with pros, like participants of a compiler course, who of coursecan read BNF . . . A clean grammar is a very systematic and structured thing whichconsequently can and should be systematically and cleanly represented in an AST,including judicious and systematic choice of names and conventions (nonterminal exprepresented by class Exp, non-terminal stmt by class Stmt etc)


Extended BNF may help alleviate the pain

BNF

exp → exp addop term | termterm → term mulop factor | factor

EBNF

exp → term{ addop term }term → factor{ mulop factor }

but remember:

• EBNF just a notation, just because we do not see (left or right) recursion in { . . . }, does notmean there is no recursion.

• not all parser generators support EBNF• however: often easy to translate into loops- 5

• does not offer a general solution if associativity etc. is problematic

Pseudo-code representing the EBNF productions

procedure exp ;beginterm ; { r e c u r s i v e c a l l }while token = "+" or token = "−"do

match ( token ) ;term ; // r e c u r s i v e c a l l

endend

procedure term ;begin

factor ; { r e c u r s i v e c a l l }while token = "∗ "do

match ( token ) ;factor ; // r e c u r s i v e c a l l

endend

5That results in a parser which is somehow not “pure recursive descent”. It’s “recursive descent, butsometimes, let’s use a while-loop, if more convenient concerning, for instance, associativity”


How to produce “something” during RD parsing?

Recursive descent

So far (mostly): RD = top-down (parse-)tree traversal via recursive procedure.6 Possible outcome:termination or failure.

• Now: instead of returning “nothing” (return type void or similar), return some meaningful,and build that up during traversal

• for illustration: procedure for expressions:– return type int,– while traversing: evaluate the expression

Evaluating an exp during RD parsing

function exp ( ) : int ;var temp : intbegin

temp := term ( ) ; { r e c u r s i v e c a l l }while token = "+" or token = "−"

case token of"+" : match ( "+" ) ;

temp := temp + term ( ) ;"−": match ("−")

temp := temp − term ( ) ;end

endreturn temp ;

end

Building an AST: expression

function exp ( ) : syntaxTree ;var temp , newtemp : syntaxTreebegin

temp := term ( ) ; { r e c u r s i v e c a l l }while token = "+" or token = "−"

case token of"+" : match ( "+" ) ;

newtemp := makeOpNode ( "+" ) ;l e f tCh i l d (newtemp) := temp ;r i gh tCh i l d (newtemp) := term ( ) ;temp := newtemp ;

"−": match ("−")newtemp := makeOpNode ( " − " ) ;l e f tCh i l d (newtemp) := temp ;r i gh tCh i l d (newtemp) := term ( ) ;temp := newtemp ;

endendreturn temp ;

end

6Modulo the fact that the tree being traversed is “conceptual” and not the input of the traversal procedure;instead, the traversal is “steered” by stream of tokens.


• note: the use of temp and the while loop

Building an AST: factor


function factor ( ) : syntaxTree ;var f a c t : syntaxTreebegin

case token of" ( " : match ( " ( " ) ;

f a c t := exp ( ) ;match ( " ) " ) ;

number :match (number )f a c t := makeNumberNode(number ) ;

else : e r r o r . . . // f a l l throughendreturn f a c t ;

end

Building an AST: conditionals

if -stmt → if ( exp ) stmt [else stmt]

function ifStmt ( ) : syntaxTree ;var temp : syntaxTreebegin

match ( " i f " ) ;match ( " ( " ) ;temp := makeStmtNode ( " i f " )t e s tCh i l d ( temp) := exp ( ) ;match ( " ) " ) ;thenChi ld ( temp) := stmt ( ) ;i f token = " else "then match " else " ;

e l s eCh i l d ( temp) := stmt ( ) ;else e l s eCh i l d ( temp) := ni l ;endreturn temp ;

end

Building an AST: remarks and “invariant”

• LL(1) requirement: each procedure/function/method (covering one specific non-terminal)decides on alternatives, looking only at the current token

• call of function A for non-terminal A:– upon entry: first terminal symbol for A in token– upon exit: first terminal symbol after the unit derived from A in token

• match("a") : checks for "a" in token and eats the token (if matched).


LL(1) parsing

For the rest of the top-down parsing section, we look at a “variation”, not as far as the principle isconcerned, but as far as the implementation is concerned. Instead of making a recursive solution,one condenses the relevant information in tabular form. This data structure is called an LL(1)table. That table is easily constructed making use of the First- and Follow-sets, and instead ofmutually recursive calls, the algo is iterative, manipulating an explicit stack. As a look forward:also the bottom-up parsers will make use of a table (which then will be an LR-table or one of itsvariants, not an LL-table).

• remember LL(1) grammars & LL(1) parsing principle:

LL(1) parsing principle

1 look-ahead enough to resolve “which-right-hand-side” non-determinism.

• instead of recursion (as in RD): explicit stack• decision making: collated into the LL(1) parsing table• LL(1) parsing table:

– finite data structure M (for instance, a 2 dimensional array)M : ΣN × ΣT → ((ΣN × Σ∗) + error)

– M [A, a] = w• we assume: pure BNF

Often, depending on the book, the entry in the parse table does not contain a full rule as here,needed is only the right-hand-side. In that case the table is of type ΣN × ΣT → (Σ∗ +error).

Construction of the parsing table

Table recipe

1. If A→ α ∈ P and α⇒∗ aβ, then add A→ α to table entry M [A,a]2. If A → α ∈ P and α ⇒∗ ε and S $ ⇒∗ βAaγ (where a is a token (=non-terminal) or $),

then add A→ α to table entry M [A,a]

Table recipe (again, now using our old friends First and Follow)

Assume A→ α ∈ P .

1. If a ∈ First(α), then add A→ α to M [A,a].2. If α is nullable and a ∈ Follow(A), then add A→ α to M [A,a].

The two recipes are equivalent. One can use the recipes to fill out LL(1) table, we will do that inthe following. In case a slot in such a table means that the grammar is not LL(1)-parseable, i.e.,the LL(1) parsing principle is violated. One may compare that also to Lemma 4.5.1.


Example: if-statements

• grammars is left-factored and not left recursive

stmt → if -stmt | otherif -stmt → if ( exp ) stmt else−part

else−part → else stmt | εexp → 0 | 1

First Followstmt other, if $, elseif -stmt if $, elseelse−part else, ε $, elseexp 0,1 )

The slide lists the first and follow set for all non-terminals (as was the basic definition for thoseconcepts). In the recipe, though, we actually need the first-set of words, namely for the right-hand sides of the productions (for the Follow-set, the definition for non-terminals is good enough).Therefore, one might, before filling out the LL(1)-table also list the first set of all right-hand sidesof the grammar. On the other hand, it’s not a big step, especially in this grammar.

Example: if statement: “LL(1) parse table”

• 2 productions in the “red table entry”• thus: it’s technically not an LL(1) table (and it’s not an LL(1) grammar)• note: removing left-recursion and left-factoring did not help!

Saying that it’s “not-an-LL(1)-table” is perhaps a bit nit-picking. The shape is according to therequired format. It’s only that in the slot marked red, there are two rules. That’s a conflictand makes it at least not a legal LL(1) table. So, if in an exam question, the task is “build theLL(1)-table for the following grammar . . . .. Is the grammar LL(1)”. Then one is supposed to fillup a table like that, and then point out, if there is a double entry, which is the symptom thatthe grammar is not LL(1). Similar remarks later for LR-parsers. Actually, for LR-parsers, tools


like yacc build up a table (not an LL, but an LR-table) and, in case of double entries, making achoice which one to include. The user, in those cases, will reveive a warning about the grammarcontaining a corresponding conflict. So the user should be aware that the grammar is actually notparseable (because a parse would require backtracking, which is not done). Conflicts are typicallyto be avoided, though upon analyzing it carefully, there may be cases, were one can “live with it”,that the parser makes a particular choice and ignore another. What kind of situations might thatbe? Actually, the one here in the example might be one. The given grammar “suffers” from theambiguity called dangling-else problem. The left-factoring massage did not help there. Anyway,the conflict in the table puts the finger onto that problem: when trying to parse an else-part andseeing the else-keyword next, the top-down parser would not know, if the else belongs to thelast “dangling” conditional or to some older one (if that existed). Typically, the parser wouldchoose the first alternative, i.e., the first production for the else-part. If one is sure of the parser’sbehavior (namely always choosing the first alternative, in case of a conflict) and if one convincesoneself that this is the intended behavior of a dangling-else (in that it should belong to the lastopen conditional), then one may “live with it”. But it’s a bit brittle.

LL(1) table-based algo

while the top of the parsing stack 6= $i f the top of the parsing stack is terminal a

and the next input token = athen

pop the parsing stack ;advance the input ; // ``match ' '

else i f the top the parsing is non-terminal Aand the next input token is a terminal or $and parsing table M [A,a] contains

production A→ X1X2 . . . Xn

then (∗ generate ∗)pop the parsing stackfor i := n to 1 dopush Xi onto the stack ;

else errori f the top of the stack = $then accept

end


LL(1): illustration of a run of the algo

The most interesting steps are of course those dealing with the dangling else, namely those withthe non-terminal else−part at the top of the stack. That’s where the LL(1) table is ambiguous.In principle, with else−part on top of the stack (in the picture it’s just L), the parser table allowsalways to make the decision that the “current statement” resp “current conditional” is done.

Expressions



left-recursive ⇒ not LL(k)




4 Parsing4.6 Error handling 159

First Followexp (,number $, )exp′ +,−, ε $, )addop +,− (,numberterm (,number $, ),+,−term′ ∗, ε $, ),+,−mulop ∗ (,numberfactor (,number $, ),+,−,∗

Expressions: LL(1) parse table

4.6 Error handling

The error handling section is not part of the pensum (it never was), insofar it will not be askedin the written exam. That does not mean that, that we don’t want some adequate error handlingfor the compiler in the oblig. The slides are not presented in detail in class. Parsers (and lexers)are built on some robust, established and well-understood theoretical foundations. That’s lessthe case for how to deal with errors, where it’s more of an art, and more pragmatics enter thepictures. It does not mean it’s unimportant, it’s just that the topic is less conceptually clarified.So, while certainly there is research, in compilers it’s mostly done “by common sense”. Parsers (andcompilers) can certainly be tested systematically, finding out if the parser detects all syntacticallyerroneous situations. Whether the corresponding feedback is useful for debugging, that a questionof whether humans can make sense out of the feedback. Different parser technologies (bottom-upvs. top-down for instance) may have different challenges to provide decent feedback. One corechallenge maybe the disconnect between the technicalities of the internal workings of the parser(which the programmer may not be aware of) and the source-level representation. A parser runsinto trouble, like encoutering an unexpected symbol, when currently looking at a field in the LL-or LR-table. That constitutes some “syntactic error” and should be reported, but it’s not evenclear what the “real cause” of an error is. Error localization as such cannot be formally solved,since one cannot properly define was the source of an error is in general. So, we focus here moreon general “advice”.

160 4 Parsing4.6 Error handling

Error handling

• at the least: do an understandable error message• give indication of line / character or region responsible for the error in the source file• potentially stop the parsing• some compilers do error recovery

– give an understandable error message (as minimum)– continue reading, until it’s plausible to resume parsing ⇒ find more errors– however: when finding at least 1 error: no code generation– observation: resuming after syntax error is not easy

Error messages

• important:– try to avoid error messages that only occur because of an already reported error!– report error as early as possible, if possible at the first point where the program cannot

be extended to a correct program.– make sure that, after an error, one doesn’t end up in a infinite loop without reading any

input symbols.• What’s a good error message?

– assume: that the method factor() chooses the alternative ( exp ) but that it, whencontrol returns from method exp(), does not find a )

– one could report : right paranthesis missing– But this may often be confusing, e.g. if what the program text is: ( a + b c )– here the exp() method will terminate after ( a + b, as c cannot extend the ex-

pression). You should therefore rather give the message error in expression orright paranthesis missing.

4 Parsing4.6 Error handling 161

Handling of syntax errors using recursive descent

Syntax errors with sync stack

162 4 Parsing4.7 Bottom-up parsing

Procedures for expression with "error recovery"

4.7 Bottom-up parsing

Bottom-up parsing: intro

"R" stands for right-most derivation.

LR(0) • only for very simple grammars• approx. 300 states for standard programming languages• only as warm-up for SLR(1) and LALR(1)

SLR(1) • expressive enough for most grammars for standard PLs• same number of states as LR(0)• main focus here

LALR(1) • slightly more expressive than SLR(1)• same number of states as LR(0)• we look at ideas behind that method as well

LR(1) covers all grammars, which can in principle be parsed by looking at the next token

There might seem to be a contradiction in the explanation of LR(0): if LR(0) is so weak that itworks only for unreasonably simple languages, why does the slides speaks about standard languages,and that LR(0) automata for those have 300 states, if one does not use LR(0)? The answer is,the other more expressive parsers (SLR(1) and LALR(1)) use the same construction of states, sothat’s why one can estimate the number of states, even if standard languages don’t have an LR(0)parser; they may have an LALR(1)-parser, which has, it its core, LR(0)-states.

4 Parsing4.7 Bottom-up parsing 163

Grammar classes overview (again)

unambiguous ambiguous

LR(k)LR(1)

LALR(1)SLRLR(0)

LL(0)

LL(1)LL(k)

LR-parsing and its subclasses

• right-most derivation (but left-to-right parsing)• in general: bottom-up: more powerful than top-down• typically: tool-supported (unlike recursive descent, which may well be hand-coded)• based on parsing tables + explicit stack• thankfully: left-recursion no longer problematic• typical tools: yacc and friends (like bison, CUP, etc.)• another name: shift-reduce parser

LR parsing tablestates

tokens + non-terms

Example grammar

S′ → SS → ABt7 | . . .A → t4t5 | t1B | . . .B → t2t3 | At6 | . . .

• assume: grammar unambiguous• assume word of terminals t1t2 . . . t7 and its (unique) parse-tree

• general agreement for bottom-up parsing:– start symbol never on the right-hand side of a production– routinely add another “extra” start-symbol (here S′)


The fact that the start symbol never occurs on the right-hand side of a production will later berelied upon when constructing a DFA for “scanning” the stack, to control the reactions of thestack machine. This restriction leads to a unique, well-defined initial state. All goes just smoother(and the construction of the LR-automaton is slightly more straighforward) if one obeys thatconvention.

Parse tree for t1 . . . t7

S′

S

A

t1

B

t2 t3

B

A

t4 t5 t6 t7

Remember: parse tree independent from left- or right-most-derivation

LR: left-to right scan, right-most derivation?

Potentially puzzling question at first sight:

what?: right-most derivation, when parsing left-to-right?

• short answer: parser builds the parse tree bottom-up• derivation:

– replacement of nonterminals by right-hand sides– derivation: builds (implicitly) a parse-tree top-down

- sentential form: word from Σ∗ derivable from start-symbol

Right-sentential form: right-most derivation

S ⇒∗r α

Slighly longer answer

LR parser parses from left-to-right and builds the parse tree bottom-up. When doing the parse,the parser (implicitly) builds a right-most derivation in reverse (because of bottom-up).


Example expression grammar (from before)



(4.8)

exp

term

term

factor

number ∗

factor

number

Bottom-up parse: Growing the parse tree

exp

term

term

factor

number ∗

factor

number

number∗number ↪→ factor ∗number↪→ term ∗number↪→ term ∗ factor↪→ term↪→ exp

The slides show in a series of overlays, how the parse-tree is growing, and at the same time, how theword number∗number is reduced step by step to the start symbol. That’s the reverse directioncompared to how one can use grammars to derive words and which corresponds to the directionof how top-down parsers work.

Reduction in reverse = right derivation

Reduction

n∗n ↪→ factor ∗n↪→ term ∗n↪→ term ∗ factor↪→ term↪→ exp


Right derivation

n∗n ⇐r factor ∗n⇐r term ∗n⇐r term ∗ factor⇐r term⇐r exp

• underlined part:– different in reduction vs. derivation– represents the “part being replaced”

∗ for derivation: right-most non-terminal∗ for reduction: indicates the so-called handle (or part of it)

• consequently: all intermediate words are right-sentential forms

Handle

Definition 4.7.1 (Handle). Assume S ⇒∗r αAw ⇒r αβw. A production A → β at position kfollowing α is a handle of αβw. We write 〈A→ β, k〉 for such a handle.

Note:

• w (right of a handle) contains only terminals• w: corresponds to the future input still to be parsed!• αβ will correspond to the stack content (β the part touched by reduction step).• the ⇒r -derivation-step in reverse:

– one reduce-step in the LR-parser-machine– adding (implicitly in the LR-machine) a new parent to children β (= bottom-up!)

• “handle”-part β can be empty (= ε)

Schematic picture of parser machine (again)

. . . if 1 + 2 ∗ ( 3 + 4 ) . . .

q0q1

q2

q3 . . .qn

Finite control

. . .

unbounded extra memory (stack)

q2



General LR “parser machine” configuration

• stack:– contains: terminals + non-terminals (+ $)– containing: what has been read already but not yet “processed”

• position on the “tape” (= token stream)– represented here as word of terminals not yet read– end of “rest of token stream”: $, as usual

• state of the machine– in the following schematic illustrations: not yet part of the discussion– later : part of the parser table, currently we explain without referring to the state of the

parser-engine– currently we assume: tree and rest of the input given– the trick ultimately will be: how do achieve the same without that tree already given

(just parsing left-to-right)

Schematic run (reduction: from top to bottom)

$ t1t2t3t4t5t6t7 $$ t1 t2t3t4t5t6t7 $$ t1t2 t3t4t5t6t7 $$ t1t2t3 t4t5t6t7 $$ t1B t4t5t6t7 $$A t4t5t6t7 $$At4 t5t6t7 $$At4t5 t6t7 $$AA t6t7 $$AAt6 t7 $$AB t7 $$ABt7 $$S $$S′ $

2 basic steps: shift and reduce

• parsers reads input and uses stack as intermediate storage• so far: no mention of look-ahead (i.e., action depending on the value of the next token(s)),

but that may play a role, as well

Shift

Move the next input symbol (terminal) over to the top of the stack (“push”)

Reduce

Remove the symbols of the right-most subtree from the stack and replace it by the non-terminalat the root of the subtree (replace = “pop + push”).

• decision easy to do if one has the parse tree already!• reduce step: popped resp. pushed part = right- resp. left-hand side of handle


The remark that it’s “easy to do” refers to something that is illustrated next: the question namelythe decision-making process of the parser. should the parser do a shift or a reduce and if so,reduce with what rule. If one assumes the “target” parse-tree as already given (as we currently doin our presentation, for instance also in the following slides), then tree embodies those decisions.Ultimately, of course, the tree is not given a priori, it’s the parser’s task to build the tree (at leastimplicitly) by making those decisions about what the next step is (shift or reduce).

Example: LR parse for “+” (given the tree)

E′ → EE → E+ n | n

CST

E′

E

E

n + n

Run

parse stack input action1 $ n + n $ shift2 $ n + n $ red:. E → n3 $E + n $ shift4 $E+ n $ shift5 $E+ n $ reduce E → E+ n6 $E $ red.: E′ → E7 $E′ $ accept

note: line 3 vs line 6!; both contain E on top of stack

(right) derivation: reduce-steps “in reverse”

E′ ⇒ E ⇒ E+ n⇒ n + n

The example is supposed to shed light on how the machine can make decisions assuming that thetree is already given. For that, one should compare the situation in stage 3 and state 6. In bothsituations, the machine has the same stack content (containing only the end-marker and E on topof the stack). However, at stage 3, the machine does a shift, whereas in stage 6, it does a reduce.


Since the stack content (representing the “past” of the parse, i.e., the already processed input)is the identical in both cases, the parser machine is necessarily in the same state in both stages,which mean, it cannot be the state that makes the difference. What then? In the example, theform of the parse tree shows what the parser should do. But of course the tree is not available.Instead (and not surprisingly). If the past input cannot be used to make the distinction, one takesthe “future” input. Maybe not all of it, but part of it. That’s a form of a look-ahead (that willnot yet be done for LR(0), as that for is without look-ahead).

Example with ε-transitions: parentheses

S′ → SS → (S )S | ε

side remark: unlike previous grammar, here:

• production with two non-terminals on the right⇒ difference between left-most and right-most derivations (and mixed ones)

Parentheses: run and right-most derivation

CST

S′

S

(

S

ε )

S

ε

Run

parse stack input action1 $ ( ) $ shift2 $ ( ) $ reduce S → ε3 $ (S ) $ shift4 $ (S ) $ reduce S → ε5 $ (S )S $ reduce S → (S )S6 $S $ reduce S′ → S7 $S′ $ accept

Note: the 2 reduction steps for the ε productions


Right-most derivation and right-sentential forms

S′ ⇒r S ⇒r (S )S ⇒r (S )⇒r ( )

Right-sentential forms & the stack

- sentential form: word from Σ∗ derivable from start-symbol

Right-sentential form: right-most derivation

S ⇒∗r α

• right-sentential forms:– part of the “run”– but: split between stack and input

parse stack input action1 $ n + n $ shift2 $ n + n $ red:. E → n3 $E + n $ shift4 $E+ n $ shift5 $E+ n $ reduce E → E+ n6 $E $ red.: E′ → E

7 $E′ $ accept

E′ ⇒r E ⇒r E+ n⇒r n + n

n + n ↪→ E+ n ↪→ E ↪→ E′

E′ ⇒r E ⇒r E+ n | ∼ E+ | n ∼ E | + n⇒r n | + n ∼| n + n

The | here is introduced as “ad-hoc” notation to illustrate the separation between theparse stack on the left and the future input on the right.

Viable prefixes of right-sentential forms and handles

• right-sentential form: E+ n• viable prefixes of RSF

– prefixes of that RSF on the stack– here: 3 viable prefixes of that RSF: E, E+, E+ n

• handle: remember the definition earlier• here: for instance in the sentential form n + n

– handle is production E → n on the left occurrence of n in n + n (let’s writen1 + n2 for now)

– note: in the stack machine:∗ the left n1 on the stack∗ rest + n2 on the input (unread, because of LR(0))


• if the parser engine detects handle n1 on the stack, it does a reduce-step• However (later): reaction depends on current state of the parser engine

A typical situation during LR-parsing

General design for an LR-engine

• some ingredients clarified up-to now:– bottom-up tree building as reverse right-most derivation,– stack vs. input,– shift and reduce steps

• however: 1 ingredient missing: next step of the engine may depend on– top of the stack (“handle”)– look ahead on the input (but not for LL(0))– and: current state of the machine (same stack-content, but different reactions

at different stages of the parse)

But what are the states of an LR-parser?

General idea:

Construct an NFA (and ultimately DFA) which works on the stack (not the input). Thealphabet consists of terminals and non-terminals ΣT ∪ ΣN . The language

Stacks(G) = {α | α may occur on the stack duringLR-parsing of a sentence in L(G) }


is regular!

Note that this is a restriction of what one can do with a stack-machine (or push-downautomaton) can do. As mentioned, exploiting the full-power of context-free grammars isimpractical, already for the fact that one does not want ambiguity (and non-determinismand backtracking). One further general restriction is that one wants a bounded look-head,maybe a look-ahead of one. The restriction here is a kind of strange one, insofar it doesnot all the stack content to be of arbitrary shape, but all allowed stack contents (for onegrammar) must be regular.

On the other hand, the restriction is also kind of natural. Any push-down automatonconsists of a stack and a finite-state automaton. It’s a natural general restriction, thatthe automaton is deterministic: given a particular input determines the state the machineis in. Realizing that the stack-content is an “abstract representation” of the past, it’snatural that the finite-state automaton is also deterministic wrt. that abstract past. Or tosay it differently: the parser machine has in some way an unbounded memory, the stack.The memory is insfor restricted, in that it can be used not via random access, but onlyvia a stack discipline with push and pop (that inhererent to the notion of context-freegrammars). Having an infinite memory is fine, one can in principle remember everything(using only push and without ever forgetting anything by using pop). But the machinehas to make also decisions based on the past. So for that decision-making part, it cannotmake infinite many different decision, based on ininitely many pasts. Relevant are are onlyfinitely many different pasts. This is the abstraction built into the stack-memory: doing apush followed by a pop does not change the stack. So both situations have the same stackcontent, so a past with a history of push and pop is treated the same as if nothing hadhappend at all. So, it’s natural to connect the state of the machine on which the decisionis made on the stack content.

LR(0) parsing as easy pre-stage

• LR(0): in practice too simple, but easy conceptual step towards LR(1), SLR(1) etc.• LR(1): in practice good enough, LR(k) not used for k > 1• to build the automaton: LR(0)-items

LR(0) parsing is introduced as easy pre-stage for the more expressive forms of bottom-upparsing later. In itself, it’s not expressive enough to be practivally useful. But the con-struction underlies directly or at least conceptually the more complex parser constructionsto come. In particular: for LR(0) parsing, the core of the construction is the so-calledLR(0)-DFA, based on LR(0)-items. This construction is directly also used for SLR-parsing.For LR(1) and LALR(1), the construction of the corresponding DFA is not identical, butanalogous to the construction of LR(0)-DFA.

LR(0) items

LR(0) item

production with specific “parser position” . in its right-hand side


• . : “meta-symbol” (not part of the production)

LR(0) item for a production A→ βγ

A→ β.γ

• item with dot at the beginning: initial item• item with dot at the end: complete item

Example: items of LR-grammar

Next two examples. They should make the concept of items clear enough. The only pointto keep in mind is the treatment of the ε symbol.

Grammar for parentheses: 3 productions

S′ → SS → (S )S | ε

8 items

S′ → .SS′ → S.S → . (S )SS → ( .S )SS → (S. )SS → (S ) .SS → (S )S.S → .

• S → ε gives S → . as item (not S → ε. and S → .ε)

As a side remark for later: it will turn out: grammar is not LR(0).

Another example: items for addition grammar

Grammar for addition: 3 productions

E′ → EE → E+ n | n


(coincidentally also:) 8 items

E′ → .EE′ → E.E → .E+ nE → E.+ nE → E+ .nE → E+ n.E → .nE → n.

Also here, it will turn out: not an LR(0) grammar

Finite automata of items

• general set-up: items as states in an automaton• automaton: “operates” not on the input, but the stack• automaton either

– first NFA, afterwards made deterministic (subset construction), or– directly DFA

States formed of sets of items

In a state marked by/containing item

A→ β.γ

• β on the stack• γ: to be treated next (terminals on the input, but can contain also non-terminals(!))

The explanation of what the items as state of the automaton means is conceptual. Onepiece may be (at the current point) a bit mysterious, resp. does not quite fit: the fact thatthe γ can contain non-terminals. We come to that soon, and we will see later in examples,what happens.

State transitions of the NFA

• X ∈ Σ• two kinds of transitions

Terminal or non-terminal

A→ α.Xη A→ αX.ηX


ε (X → β)

A→ α.Xη X → .βε

• In case X = terminal (i.e. token) =– the left step corresponds to a shift step

• for non-terminals (see next slide):– interpretation more complex: non-terminals are officially never on the input– note: in that case, item A→ α.Xη has two (kinds of) outgoing transitions

Explanations

We have explained shift steps so far as: parser eats one terminal (= input token) andpushes it on the stack.

Transitions for non-terminals and ε

• so far: we never pushed a non-terminal from the input to the stack, we replace in areduce-step the right-hand side by a left-hand side

• but: replacement in a reduce steps can be seen as1. pop right-hand side off the stack,2. instead, “assume” corresponding non-terminal on input,3. eat the non-terminal an push it on the stack.

• two kinds of transitions• assume production X → β and initial item X → .β

Transitions (repeated)

Terminal or non-terminal

A→ α.Xη A→ αX.ηX

Epsilon (X: non-terminal here)

Given production X → β:

A→ α.Xη X → .βε


NFA: parentheses

S′ → .S S′ → S.

S → . (S )S S → . S → (S )S.

S → ( .S )S S → (S. )S

S → (S ) .S

S

εε

( εε

S

)

S

ε

ε

In the figure, we use colors for illustration, only, i.e., they are not officially part of theconstruction. The colors are intended to represent the following:

• “reddish”: complete items• “blueish”: init-item (less important)• “violet’ish”: both.

Furthermore, you may notice for the initial items and complete items:

• one initial item state per production of the grammar• initial items is where the ε-transisitions go into, but with exception of the initial state

(with S′-production)• no outgoing edges from the complete items.

Note the uniformity of the ε-transitions in the following sense. For each production with agiven non-terminal (for instance S in the given example), there is one ingoing ε-transitionfrom each state/item where the . is in front of said non-terminal.

To look forward, and concerning the role of the ε-transitions. Those are allowed for non-determistic automata, but not for DFAs. The underlying construction (discussed later)is building the ε-closure, in this case the close of A′ → A. If one does that directly, oneobtains directly a DFA (as opposed to first do an NFA to make deterministic in a secondphase).

Initial and final states

initial states:

• we made our lives easier : assume one extra start symbol say S′ (augmented grammar)⇒ initial item S′ → .S as (only) initial state


final states:

acceptance condition of the overall machine: a bit more complex

• input must be empty• stack must be empty except the (new) start symbol• NFA has a word to say about acceptence

– but not in form of being in an accepting state– so: no accepting states– but: accepting action (see later)

The NFA (or later DFA) has a specific task, it is used to “scan” the stack (at leastconceptually), not the input. The automaton is not so much for accepting a stack and thenstop, it’s more like determining the state that corresponds to the current stack content.Therefore there are no accepting states in the sense of a FSA!

NFA: addition

E′ → .E E′ → E.

E → .E+ n E → .n E → n.

E → E.+ n E → E+ .n E → E+ n.

E

εε

εε

E

n

+ n

Determinizing: from NFA to DFA

• standard subset-construction7• states then contain sets of items• important: ε-closure• also: direct construction of the DFA possible

In the following two slides, we show the DFAs corresponding to the NFAs shown before.For the construction on how to determinize NFAs (and minimize them), we refer to thecorresponding sections in the chapter about lexing. Anyway, we will afterwards also lookat a direct construction of the DFA (without the detour over NFAs). That will result inthe same automata anyway.

7Technically, we don’t require here a total transition function, we leave out any error state.


DFA: parentheses

S′ → .S

S → . (S )SS → .

0

S′ → S.

1

S → ( .S )SS → . (S )SS → .

2

S → (S. )S3

S → (S ) .SS → . (S )SS → .

4

S → (S )S.5

S

(

S(

)

(S

DFA: addition

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +n

Direct construction of an LR(0)-DFA

• quite easy: just build in the closure directly. . .

ε-closure

• if A→ α.Bγ is an item in a state where• there are productions B → β1 | β2 . . . then• add items B → .β1 , B → .β2 . . . to the state• continue that process, until saturation

initial state

S′ → .S

plus closure


Direct DFA construction: transitions

. . .

A1 → α1.Xβ1

. . .

A2 → α2.Xβ2

. . .

A1 → α1X.β1

A2 → α2X.β2

plus closure

X

• X: terminal or non-terminal, both treated uniformely• All items of the form A→ α.Xβ must be included in the post-state• and all others (indicated by ". . . ") in the pre-state: not included

One can e-check the previous examples (first doing the NFA, then the DFA): the outcomeis the same.

How does the DFA do the shift/reduce and the rest?

• we have seen: bottom-up parse tree generation• we have seen: shift-reduce and the stack vs. input• we have seen: the construction of the DFA

But: how does it hang together?

We need to interpret the “set-of-item-states” in the light of the stack content and figureout the reaction in terms of

• transitions in the automaton• stack manipulations (shift/reduce)• acceptance• input (apart from shifting) not relevant when doing LR(0)

and the reaction better be uniquely determined . . . .

Stack contents and state of the automaton

• remember: at any config. of stack/input in a run1. stack contains words from Σ∗2. DFA operates deterministically on such words

• the stack contains “abstraction of the past”:• when feeding that “past” on the stack into the automaton

– starting with the oldest symbol (not in a LIFO manner)– starting with the DFA’s initial state⇒ stack content determines state of the DFA

• actually: each prefix also determines uniquely a state• top state:


– state after the complete stack content– corresponds to the current state of the stack-machine⇒ crucial when determining reaction

State transition allowing a shift

• assume: top-state (= current state) contains item

X → α.aβ

• construction thus has transition as follows

. . .

X → α.aβ. . .

s. . .

X → αa.β. . .

t

a

• shift is possible• if shift is the correct operation and a is terminal symbol corresponding to the current

token: state afterwards = t

State transition: analogous for non-term’s

Production

X → α.Bβ

Transition

. . .

X → α.Bβ

s. . .

X → αB.β

tB

Rest

• “goto = shift for non-terms”• intuition: “second half of a reduce step”

• same as before, now with non-terminal B


• note: we never read non-term from input• not officially called a shift• corresponds to the reaction followed by a reduce step, it’s not the reduce step itself• think of the reduce

– not as: replace on top of the stack the handle (right-hand side) by non-term B,– but instead as:

1. pop off the handle from the top of the stack2. put the non-term B “back onto the input” (corresponding to the above states)

3. eat the B and “shift” it to the stack• later: a goto reaction in the parse table

State (not transition) where a reduce is possible

• remember: complete items• assume top state s containing complete item A→ γ.

. . .

A→ γ.

s

• a complete right-hand side (“handle”) γ on the stack and thus done• may be replaced by right-hand side A⇒ reduce step• builds up (implicitly) new parent node A in the bottom-up procedure• Note: A on top of the stack instead of γ:

– new top state!– remember the “goto-transition” (shift of a non-terminal)

A conceptual picture for the reduce step is as follows. As said, we remove the handle fromthe stack, and “pretend”, as if the A is next on the input, and thus we “shift” it on top ofthe stack, doing the corresponding A-transition.

Remarks: states, transitions, and reduce steps

• ignoring the ε-transitions (for the NFA)• there are 2 “kinds” of transitions in the DFA

1. terminals: reals shifts2. non-terminals: “following a reduce step”

No edges to represent (all of) a reduce step!

• if a reduce happens, parser engine changes state!• however: this state change is not represented by a transition in the DFA (or NFA

for that matter)• especially not by outgoing errors of completed items


• if the (rhs of the) handle is removed from top stack ⇒– “go back to the (top) state before that handle had been added”: no edge for

that• later: stack notation simply remembers the state as part of its configuration

Example: LR parsing for addition (given the tree)

E′ → EE → E+ n | n

CST

E′

E

E

n + n

Run


note: line 3 vs line 6!; both contain E on top of stack

This is a revisit of an example resp. slide from earlier, when we discussed how a parser can dodecisions, resp. that it would be easy to do decisions for the parser machine if it had the treealready. Unfortunately it has the tree not available, the only thing it has is “the past” which isrepresented (partially) by the stack content. As discussed earlier, interesting in the run are stage3 and state 6, which have the same stack content, which also means, the parser is in the samestate of its LR(0)-DFA. With the automaton constructed as before, that’s state 1. The state 1 isimportant, as it illustrates a shift/reduce conflict. Remember: reduce-steps are not representedin the LR(0)-automaton via transitions. They are only implicitly represented by complete items.Thus, as shift-reduce conflict is not characterized by 2 outgoing edges. It’s one outgoing edgefrom a state containing a complete item.


Earlier we hinted at that an automaton could make decisions based on a look-head. That is not yetdone: the LR(0), in state 1 especially, can do a reduce step or a shift step, which constitutes theconflict. Later, we will see under which circumstances, looking at the “next symbol” can help tomake the decision. That leads to SLR parsing (or even later to LR(1)/LALR(1)). In the particularsituation of state 1 in the example, the next possible symbol would be + or else $

DFA of addition example

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +

n

• note line 3 vs. line 6• both stacks = E ⇒ same (top) state in the DFA (state 1)

The point being made when lookig at that 1 is the following: the state is a complete state (a statecontaining a complete item). Besides that, there is an outgoing edge. That means, in that state,there are two reactions possible: a shift (following the edge) and a reduce, as indicated by thecomplete item. That indicates a conflict-situation, especially if we don’t make use of look-aheads,as we do currently, when discussing LR(0). The conflict-situation is called, not surprisingly, a“shift-reduce-conflict”, more precisely an LR(0)-shift/reduce conflict. The qualification LR(0) isnecessary, as sometimes, a more close look at the situation and taking a look-ahead into accountmay defuse the conflict. Those more fine-grainend considerations will lead to extensions of theplain LR(0)-parsing (like SLR(0), or LR(1) and LALR(1)).

LR(0) grammars

LR(0) grammar

The top-state alone determines the next step.

• especially: no shift/reduce conflicts in the form shown• thus: previous addition-grammar is not LR(0)

Simple parentheses

A → (A ) | a


DFA

A′ → .A

A→ . (A )A→ .a

0

A′ → A.

1

A→ ( .A )A→ . (A )A→ .a

3

A→ a.2

A→ (A. )4

A→ (A ) .5

A

a(

(a

A

)

Simple parentheses is LR(0)

DFA

A′ → .A

A→ . (A )A→ .a

0

A′ → A.

1

A→ ( .A )A→ . (A )A→ .a

3

A→ a.2

A→ (A. )4

A→ (A ) .5

A

a(

(a

A

)

Remarks

state possible action0 only shift1 only red: (A′ → A)2 only red: (A→ a)3 only shift4 only shift5 only red (A→ (A ))


NFA for simple parentheses (bonus slide)

A′ → .A A′ → A.

A→ . (A ) A→ .a

A→ ( .A ) A→ (A. )

A→ a.

A→ (A ) .

A

εε

εε

(

a

A )

For completeness sake: that’s the NFA for the “simple parentheses”.

Parsing table for an LR(0) grammar

• table structure: slightly different for SLR(1), LALR(1), and LR(1) (see later)• note: the “goto” part: “shift” on non-terminals (only 1 non-terminal A here)• corresponding to the A-labelled transitions

state action rule input goto( a ) A

0 shift 3 2 11 reduce A′ → A2 reduce A→ a3 shift 3 2 44 shift 55 reduce A→ (A )

Parsing of ( ( a ) )

stage parsing stack input action

1 $0 ( ( a ) ) $ shift2 $0(3 ( a ) ) $ shift3 $0(3(3 a ) ) $ shift4 $0(3(3a2 ) ) $ reduce A→ a5 $0(3(3A4 ) ) $ shift6 $0(3(3A4)5 ) $ reduce A→ (A )7 $0(3A4 ) $ shift8 $0(3A4)5 $ reduce A→ (A )9 $0A1 $ accept

• note: stack on the left– contains top state information– in particular: overall top state on the right-most end

• note also: accept action– reduce wrt. to A′ → A and– empty stack (apart from $, A, and the state annotation)⇒ accept

The left-most column is just line numbers (“stage” of the computation), it’s not the state.


Parse tree of the parse

A′

A

(

A

(

A

a ) )

• As said:– the reduction “contains” the parse-tree– reduction: builds it bottom up– reduction in reverse: contains a right-most derivation (which is “top-down”)

• accept action: corresponds to the parent-child edge A′ → A of the tree

Parsing of erroneous input

• empty slots it the table: “errors”

stage parsing stack input action1 $0 ( ( a ) $ shift2 $0(3 ( a ) $ shift3 $0(3(3 a ) $ shift4 $0(3(3a2 ) $ reduce A→ a5 $0(3(3A4 ) $ shift6 $0(3(3A4)5 $ reduce A→ (A )7 $0(3A4 $ ????

stage parsing stack input action1 $0 ( ) $ shift2 $0(3 ) $ ?????

Invariant

important general invariant for LR-parsing: never shift something “illegal” onto the stack

LR(0) parsing algo, given DFA

let s be the current state, on top of the parse stack

1. s contains A→ α.Xβ, where X is a terminal• shift X from input to top of stack. The new state pushed on the stack: state t where s X−→ t• else: if s does not have such a transition: error

2. s contains a complete item (say A→ γ.): reduce by rule A→ γ:• A reduction by S′ → S: accept, if input is empty; else error:• else:

pop: remove γ (including “its” states from the stack)back up: assume to be in state u which is now head state

push: push A to the stack, new head state t where u A−→ t (in the DFA)


DFA parentheses again: LR(0)?

S′ → SS → (S )S | ε

S′ → .S

S → . (S )SS → .

0

S′ → S.

1

S → ( .S )SS → . (S )SS → .

2

S → (S. )S3

S → (S ) .SS → . (S )SS → .

4

S → (S )S.5

S

(

S(

)

(

S

Look at states 0, 2, and 4

DFA addition again: LR(0)?

E′ → EE → E+ n | n

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +

n

How to make a decision in state 1?


Decision? If only we knew the ultimate tree already (expecially the parts stillto come). . .

CST

E′

E

E

n + n

Run


• current stack: represents already known part of the parse tree• since we don’t have the future parts of the tree yet:⇒ look-ahead on the input (without building the tree yet)• LR(1) and its variants: look-ahead of 1 (= look at the current type of the token)

Addition grammar (again)

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +

n

• How to make a decision in state 1? (here: shift vs. reduce)⇒ look at the next input symbol (in the token)


One look-ahead

• LR(0), not useful, too weak• add look-ahead, here of 1 input symbol (= token)• different variations of that idea (with slight difference in expresiveness)• tables slightly changed (compared to LR(0))• but: still can use the LR(0)-DFAs

Resolving LR(0) reduce/reduce conflicts

LR(0) reduce/reduce conflict:

. . .

A→ α.

. . .

B → β.

SLR(1) solution: use follow sets of non-terms

• If Follow(A) ∩ Follow(B) = ∅⇒ next symbol (in token) decides!

– if token ∈ Follow(α) then reduce using A→ α– if token ∈ Follow(β) then reduce using B → β– . . .

Resolving LR(0) shift/reduce conflicts

LR(0) shift/reduce conflict:

. . .

A→ α.

. . .

B1 → β1.b1γ1

B2 → β2.b2γ2

b1

b2

SLR(1) solution: again: use follow sets of non-terms

• If Follow(A) ∩ {b1,b2, . . .} = ∅⇒ next symbol (in token) decides!

– if token ∈ Follow(A) then reduce using A → α, non-terminal A determines new topstate

– if token ∈ {b1,b2, . . .} then shift. Input symbol bi determines new top state– . . .


SLR(1) requirement on states (as in the book)

• formulated as conditions on the states (of LR(0)-items)• given the LR(0)-item DFA as defined

SLR(1) condition, on all states s

1. For any item A → α.Xβ in s with X a terminal, there is no complete item B → γ. in swith X ∈ Follow(B).

2. For any two complete items A→ α. and B → β. in s, Follow(α) ∩ Follow(β) = ∅

Revisit addition one more time

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +

n

• Follow(E′) = {$}⇒ – shift for +

– reduce with E′ → E for $ (which corresponds to accept, in case the input is empty)

SLR(1) algo

let s be the current state, on top of the parse stack

1. s contains A → α.Xβ, where X is a terminal and X is the next token on the input,then

• shift X from input to top of stack. The new state pushed on the stack: state t wheres

X−→ t8

2. s contains a complete item (say A→ γ.) and the next token in the input is in Follow(A):reduce by rule A→ γ:

• A reduction by S′ → S: accept, if input is empty9

• else:pop: remove γ (including “its” states from the stack)back up: assume to be in state u which is now head statepush: push A to the stack, new head state t where u A−→ t

3. if next token is such that neither 1. or 2. applies: error8Cf. to the LR(0) algo: since we checked the existence of the transition before, the else-part is missingnow.

9Cf. to the LR(0) algo: This happens now only if next token is $. Note that the follow set of S′ in theaugmented grammar is always only $


Parsing table for SLR(1)

E′ → .E

E → .E+ nE → .n

0

E′ → E.

E → E.+ n

1

E → n.2

E → E+ .n3

E → E+ n.4

E

n +

n

state input goton + $ E

0 s : 2 11 s : 3 accept2 r : (E → n)3 s : 44 r : (E → E+ n) r : (E → E+ n)

for state 2 and 4: n /∈ Follow(E)

Parsing table: remarks

• SLR(1) parsing table: rather similar-looking to the LR(0) one• differences: reflect the differences in: LR(0)-algo vs. SLR(1)-algo• same number of rows in the table ( = same number of states in the DFA)• only: colums “arranged” differently

– LR(0): each state uniformely: either shift or else reduce (with given rule)– now: non-uniform, dependent on the input. But that does not apply to the previous

example. We’ll see that in the next, then.• it should be obvious:

– SLR(1) may resolve LR(0) conflicts– but: if the follow-set conditions are not met: SLR(1) shift-shift and/or SLR(1) shift-

reduce conflicts– would result in non-unique entries in SLR(1)-table10

SLR(1) parser run (= “reduction”)

state input goton + $ E

0 s : 2 11 s : 3 accept2 r : (E → n)3 s : 44 r : (E → E+ n) r : (E → E+ n)

10by which it, strictly speaking, would no longer be an SLR(1)-table :-)


stage parsing stack input action

1 $0 n + n + n $ shift: 22 $0n2 + n + n $ reduce: E → n3 $0E1 + n + n $ shift: 34 $0E1+3 n + n $ shift: 45 $0E1+3n4 + n $ reduce: E → E+ n6 $0E1 n $ shift 37 $0E1+3 n $ shift 48 $0E1+3n4 $ reduce: E → E+ n9 $0E1 $ accept

Corresponding parse tree

E′

E

E

E

n + n + n

Revisit the parentheses again: SLR(1)?

Grammar: parentheses

S′ → SS → (S )S | ε

Follow set

Follow(S) = {),$}


DFA for parentheses

S′ → .S

S → . (S )SS → .

0

S′ → S.

1

S → ( .S )SS → . (S )SS → .

2

S → (S. )S3

S → (S ) .SS → . (S )SS → .

4

S → (S )S.5

S

(

S(

)

(S

SLR(1) parse table

state input goto( ) $ S

0 s : 2 r : S → ε r : S → ε 11 accept2 s : 2 r : S → ε r : S → ε 33 s : 44 s : 2 r : S → ε r : S → ε 55 r : S → (S )S r : S → (S )S

Parentheses: SLR(1) parser run (= “reduction”)

state input goto( ) $ S

0 s : 2 r : S → ε r : S → ε 11 accept2 s : 2 r : S → ε r : S → ε 33 s : 44 s : 2 r : S → ε r : S → ε 55 r : S → (S )S r : S → (S )S


stage parsing stack input action1 $0 ( ) ( ) $ shift: 22 $0(2 ) ( ) $ reduce: S → ε3 $0(2S3 ) ( ) $ shift: 44 $0(2S3)4 ( ) $ shift: 25 $0(2S3)4(2 ) $ reduce: S → ε6 $0(2S3)4(2S3 ) $ shift: 47 $0(2S3)4(2S3)4 $ reduce: S → ε8 $0(2S3)4(2S3)4S5 $ reduce: S → (S )S9 $0(2S3)4S5 $ reduce: S → (S )S10 $0S1 $ accept

Remarks

Note how the stack grows, and would continue to grow if the sequence of ( ) would continue.That’s characteristic for a right-recursive formulation of rules, and may constitute a problem forLR-parsing (stack-overflow).

Ambiguity & LR-parsing

• LR(k) (and LL(k)) grammars: unambiguous• definition/construction: free of shift/reduce and reduce/reduce conflict (given the chosen

level of look-ahead)• However: ambiguous grammar tolerable, if (remaining) conflicts can be solved “meaningfully”

otherwise:

Additional means of disambiguation:

1. by specifying associativity / precedence “externally”2. by “living with the fact” that LR parser (commonly) prioritizes shifts over reduces

• for the second point (“let the parser decide according to its preferences”):– use sparingly and cautiously– typical example: dangling-else– even if parsers makes a decision, programmar may or may not “understand intuitively”

the resulting parse tree (and thus AST)– grammar with many S/R-conflicts: go back to the drawing board

Example of an ambiguous grammarstmt → if -stmt | other

if -stmt → if ( exp ) stmt| if ( exp ) stmt else stmt

exp → 0 | 1

In the following, E for exp, etc.


Simplified conditionals

Simplified “schematic” if-then-else

S → I | otherI → if S | if S else S

Follow-sets

FollowS′ {$}S {$, else}I {$, else}

• since ambiguous: at least one conflict must be somewhere

DFA of LR(0) items

Checking the previously shown conditions for SLR(1) parsing, one sees that there is a SLR(1)conflict in state 5: the follow-set of I contains else. In the following tables, only the shift-reactionis added in the corresponding slot (not both shift and reduce action), since that is default reactionof a parser tool, when facing a shift-reduce conflict.


Simple conditionals: parse table

Grammar

S → I (1)| other (2)

I → if S (3)| ifS else S (4)

SLR(1)-parse-table, conflict “resolved”

state input gotoif else other $ S I

0 s : 4 s : 3 1 21 accept2 r : 1 r : 13 r : 2 r : 24 s : 4 s : 3 5 25 s : 6 r : 36 s : 4 s : 3 7 27 r : 4 r : 4

• shift-reduce conflict in state 5: reduce with rule 3 vs. shift (to state 6)• conflict there: resolved in favor of shift to 6• note: extra start state left out from the table

Parser run (= reduction)

Parser run, different choice

state input gotoif else other $ S I

0 s : 4 s : 3 1 21 accept2 r : 1 r : 13 r : 2 r : 24 s : 4 s : 3 5 25 s : 6 r : 36 s : 4 s : 3 7 27 r : 4 r : 4


stage parsing stack input action1 $0 if if other else other $ shift: 42 $0if 4 if other else other $ shift: 43 $0if 4if 4 other else other $ shift: 34 $0if 4if 4other3 else other $ reduce: 25 $0if 4if 4S5 else other $ reduce 36 $0if 4I2 else other $ reduce 17 $0if 4S5 else other $ shift 68 $0if 4S5else6 other $ shift 39 $0if 4S5else6other3 $ reduce 2

10 $0if 4S5else6S7 $ reduce 411 $0S1 $ accept

Parse trees for the “simple conditions”

shift-precedence: conventional

S

if

I

if

S

other else

S

other

“wrong” tree

S

if

I

if

S

other else

S

other

standard “dangling else” convention

“an else belongs to the last previous, still open (= dangling) if-clause”

The example serves two purposes: for once shed a light on how the dangling else problem canbe “solved” by preferring as shift over a reduce reaction. More generally, it should give (usingthat standard situation) give a feeling how generally a shift-vs-reduce changes the structure of the


parse-tree (and indirectly most probably thereby also the AST). It’s an issue of associativity andprecedence (at least when dealing with binary operators), and we will see that in the followingstandard setting of expressions.

Use of ambiguous grammars

• advantage of ambiguous grammars: often simpler• if ambiguous: grammar guaranteed to have conflicts• can be (often) resolved by specifying precedence and associativity• supported by tools like yacc and CUP . . .

E′ → EE → E+E | E ∗E | n

DFA for + and ×

E′ → .E

E → .E+E

E → .E ∗E

E → .n

0

E′ → E.

E → E.+E

E → E.∗E

1

E → E+ .E

E → .E+E

E → .E ∗E

E → .n

3

E → E+E.

E → E.+E

E → E.∗E

5

E → E ∗E.

E → E.+E

E → E.∗E

6

E → n.

2E → E ∗ .E

E → .E+E

E → .E ∗E

E → .n

4

E

n

+

∗

n

E

∗

∗

+

E

+

n

States with conflicts

• state 5– stack contains ...E+E– for input $: reduce, since shift not allowed form $– for input +; reduce, as + is left-associative– for input ∗: shift, as ∗ has precedence over +

• state 6:– stack contains ...E ∗E


– for input $: reduce, since shift not allowed form $– for input +; reduce, a ∗ has precedence over +– for input ∗: reduce, as ∗ is left-associative

• see also the table on the next slide

Parse table + and ×

state input goton + ∗ $ E

0 s : 2 11 s : 3 s : 4 accept2 r : E → n r : E → n r : E → n3 s : 2 54 s : 2 65 r : E → E+E s : 4 r : E → E+E6 r : E → E ∗E r : E → E ∗E r : E → E ∗E

How about exponentiation (written ↑ or ∗∗)?

Defined as right-associative. See exercise

An interesting line is the one for state 5, and the difference in reaction when ecncountering aaddition vs. a multiplication sign. Basically, the shift for multiplication realizes the fact thatmultiplication has a higher precedence than addition

Compare: unambiguous grammar for + and ∗

Unambiguous grammar: precedence and left-assoc built in

E′ → EE → E+T | TT → T ∗n | n

FollowE′ {$} (as always for start symbol)E {$,+}T {$,+,∗}


DFA for unambiguous + and ×

E′ → .E

E → .E+T

E → .T

E → .T ∗nE → .n

0

E′ → E.

E → E.+T

1E → E+ .T

T → .T ∗nT → .n

2

T → n.

3

E → T .

T → T .∗n

4

T → T ∗ .n5

E → E+T .

T → T .∗n

6

T → T ∗n.7

E

n

T

+

nT

∗n

∗

DFA remarks

• the DFA now is SLR(1)– check states with complete itemsstate 1: Follow(E′) = {$}state 4: Follow(E) = {$,+}state 6: Follow(E) = {$,+}state 3/7: Follow(T ) = {$,+,∗}

– in no case there’s a shift/reduce conflict (check the outgoing edges vs. the follow set)– there’s not reduce/reduce conflict either

LR(1) parsing

• most general from of LR(1) parsing• aka: canonical LR(1) parsing• usually: considered as unecessarily “complex” (i.e. LALR(1) or similar is good enough)• “stepping stone” towards LALR(1)

Basic restriction of SLR(1)

Uses look-ahead, yes, but only after it has built a non-look-ahead DFA (based on LR(0)-items)


A help to remember

SLR(1) “improved” LR(0) parsing LALR(1) is “crippled” LR(1) parsing.

Limits of SLR(1) grammars

Assignment grammar fragment11

stmt → call-stmt | assign-stmtcall-stmt → identifier

assign-stmt → var := expvar → [ exp ] | identifierexp → var | n

Assignment grammar fragment, simplified

S → id | V :=EV → idE → V | n

The problematic situation, as we will see on the next slide, concerns identifiers (resp. variables asleft-hand side of an assignment or as a call expression).

11Inspired by Pascal, analogous problems in C . . .


non-SLR(1): Reduce/reduce conflict

S′ → .S

S → .id

S → .V :=E

V → .id

S → id.

V → id.

. . .

. . .

S

id

V

S′ → .S

S → .id

S → .V :=E

V → .id

S → id. $

V → id. $, :=

. . .

. . .

S

id

V

First FollowS id $V id $, :=E id,n $

Checking the previously shown conditions for SLR(1)-parsing shows (amongst others) a reduce/re-duce conflict situation in the state on the right-hand side. The R/R conflict is on the symbol $:the parser does not know which production to use in the reduce step. The red terminals are notpart of the state, they are just shown for illustration (representing the follow symbols of S resp.of V ). The LR(1) construction (sketched on the next slides) builds in one additional look-aheadsymbol officially as parts of the items and thus states.


Situation can be saved: more look-ahead

S′ → .S $

S → .id $

S → .V :=E $

V → .id :=

S → id. $

V → id. :=

. . .

. . .

S

id

V

The (sketch of the ) automaton here looks pretty similar to the previous one. However, we shouldthink now of the non-terminals as officially part of the items. The interesting piece in this exampleis the transition from the initial state following the id-transition, to the state containing the items→ id. and V → id.. That was the state on the previous slide with the reduce/reduce conflict (onthe following symbol $). Now, without showing the construction in detail (later we give at leastthe rules for the construction of the NFA, not the DFA with the closure): the interesting situationis, in the first state, the item S → .V :=E,$. With the . in front of the V , that’s when we haveto take the ε-closure into account, basically adding also the initial items (here one initial item) forthe productions for V into account. Now, by adding that item V → .id, we can use the additional“look-ahead piece of information” in that item to mark that V was added to the closure whenbeing in front of an :=. That leads (in this situation) to the item of the form [V → .id, :=]. Thisinformation is more specific than the knowledge about the general follow-set of V , which constains:= and $. Now, by recording that extra piece of information in the closure, the state remembersthat the only thing at the current state that is allowed to follow the V is the :=. That will defusethe discussed conflict, namely as follows: if we follow the id-arrow, we end up in the state on theright-hand side. Such a transition does not touch the additional new look-ahead information (herethe $ resp the := symbol). Thus, in the state at the right-hand side, the reduce-reduce conflicthas disappeared!

LALR(1) (and LR(1)): Being more precise with the follow-sets

• LR(0)-items: too “indiscriminate” wrt. the follow sets• remember the definition of SLR(1) conflicts• LR(0)/SLR(1)-states:

– sets of items12 due to subset construction– the items are LR(0)-items– follow-sets as an after-thought

12That won’t change in principle (but the items get more complex)


Add precision in the states of the automaton already

Instead of using LR(0)-items and, when the LR(0) DFA is done, try to add a little disambiguationwith the help of the follow sets for states containing complete items, better make more fine-grained items from the very start:

• LR(1) items• each item with “specific follow information”: look-ahead

LR(1) items

• main idea: simply make the look-ahead part of the item• obviously: proliferation of states13

LR(1) items

[A→ α.β,a] (4.9)

• a: terminal/token, including $

LALR(1)-DFA (or LR(1)-DFA)

13Not to mention if we wanted look-ahead of k > 1, which in practice is not done, though.


Remarks on the DFA

• Cf. state 2 (seen before)– in SLR(1): problematic (reduce/reduce), as Follow(V ) = {:=,$}– now: diambiguation, by the added information

• LR(1) would give the same DFA

Full LR(1) parsing

• AKA: canonical LR(1) parsing• the best you can do with 1 look-ahead• unfortunately: big tables• pre-stage to LALR(1)-parsing

SLR(1)

LR(0)-item-based parsing, with afterwards adding some extra “pre-compiled” info (about follow-sets) to increase expressivity

LALR(1)

LR(1)-item-based parsing, but afterwards throwing away precision by collapsing states, to savespace

LR(1) transitions: arbitrary symbol

• transitions of the NFA (not DFA)

X-transition

[A→ α.Xβ,a] [A→ αX.β,a]X

LR(1) transitions: ε

ε-transition

for allB → β1 | β2 . . . and all b ∈ First(γa)

[A→ α.Bγ ,a] [B → .β ,b]ε


including special case (γ = ε)

for all B → β1 | β2 . . .

[A→ α.B ,a] [B → .β ,a]ε

LALR(1) vs LR(1)

LALR(1)


LR(1)

Core of LR(1)-states

• actually: not done that way in practice• main idea: collapse states with the same core

Core of an LR(1) state

= set of LR(0)-items (i.e., ignoring the look-ahead)

• observation: core of the LR(1) item = LR(0) item• 2 LR(1) states with the same core have same outgoing edges, and those lead to states with

the same core


LALR(1)-DFA by as collapse

• collapse all states with the same core• based on above observations: edges are also consistent• Result: almost like a LR(0)-DFA but additionally

– still each individual item has still look ahead attached: the union of the “collapsed”items

– especially for states with complete items [A → α,a,b, . . .] is smaller than the followset of A

– ⇒ less unresolved conflicts compared to SLR(1)

Concluding remarks of LR / bottom up parsing

• all constructions (here) based on BNF (not EBNF)• conflicts (for instance due to ambiguity) can be solved by

– reformulate the grammar, but generarate the same language14

– use directives in parser generator tools like yacc, CUP, bison (precedence, assoc.)– or (not yet discussed): solve them later via semantical analysis– NB: not all conflics are solvable, also not in LR(1) (remember ambiguous languages)

LR/bottom-up parsing overview

advantages remarksLR(0) defines states also used by

SLR and LALRnot really used, many con-flicts, very weak

SLR(1) clear improvement overLR(0) in expressiveness,even if using the samenumber of states. Tabletypically with 50K entries

weaker than LALR(1).but often good enough.Ok for hand-made parsersfor small grammars

LALR(1) almost as expressive asLR(1), but number ofstates as LR(0)!

method of choice for mostgenerated LR-parsers

LR(1) the method coveringall bottom-up, one-look-ahead parseablegrammars

large number of states(typically 11M of entries),mostly LALR(1) preferred

Remember: once the table specific for LR(0), . . . is set-up, the parsing algorithms all work thesame

Error handling

Minimal requirement

Upon “stumbling over” an error (= deviation from the grammar): give a reasonable & understand-able error message, indicating also error location. Potentially stop parsing14If designing a new language, there’s also the option to massage the language itself. Note also: there are

inherently ambiguous languages for which there is no unambiguous grammar.


• for parse error recovery– one cannot really recover from the fact that the program has an error (an syntax error

is a syntax error), but– after giving decent error message:

∗ move on, potentially jump over some subsequent code,∗ until parser can pick up normal parsing again∗ so: meaningfull checking code even following a first error

– avoid: reporting an avalanche of subsequent spurious errors (those just “caused” by thefirst error)

– “pick up” again after semantic errors: easier than for syntactic errors

Error messages

• important:– avoid error messages that only occur because of an already reported error!– report error as early as possible, if possible at the first point where the program cannot

be extended to a correct program.– make sure that, after an error, one doesn’t end up in an infinite loop without reading

any input symbols.• What’s a good error message?

– assume: that the method factor() chooses the alternative ( exp ) but that it , whencontrol returns from method exp(), does not find a )

– one could report : right parenthesis missing– But this may often be confusing, e.g. if what the program text is: ( a + b c )– here the exp() method will terminate after ( a + b, as c cannot extend the ex-

pression). You should therefore rather give the message error in expression orright parenthesis missing.

Error recovery in bottom-up parsing

• panic recovery in LR-parsing– simple form– the only one we shortly look at

• upon error: recovery ⇒– pops parts of the stack– ignore parts of the input

• until “on track again”• but: how to do that• additional problem: non-determinism

– table: constructed conflict-free under normal operation– upon error (and clearing parts of the stack + input): no guarantee it’s clear how to

continue⇒ heuristic needed (like panic mode recovery)

Panic mode idea

• try a fresh start,• promising “fresh start” is: a possible goto action• thus: back off and take the next such goto-opportunity


Possible error situation

parse stack input action1 $0a1b2c3(4d5e6 f ) gh . . .$ no entry for f2 $0a1b2c3Bv gh . . .$ back to normal3 $0a1b2c3Bvg7 h . . .$ . . .

state input goto. . . ) f g . . . . . . A B . . .

. . .3 u v4 − − −5 − − −6 − − − −. . .u − − reduce . . .v − − shift : 7. . .

Panic mode recovery

Algo

1. Pop states for the stack until a state is found with non-empty goto entries2. • If there’s legal action on the current input token from one of the goto-states, push token

on the stack, restart the parse.• If there’s several such states: prefer shift to a reduce• Among possible reduce actions: prefer one whose associated non-terminal is least general

3. if no legal action on the current input token from one of the goto-states: advance input untilthere is a legal action (or until end of input is reached)

Example again

parse stack input action1 $0a1b2c3(4d5e6 f ) gh . . .$ no entry for f2 $0a1b2c3Bv gh . . .$ back to normal3 $0a1b2c3Bvg7 h . . .$ . . .

• first pop, until in state 3• then jump over input

– until next input g– since f and ) cannot be treated

• choose to goto v (shift in that state)

Panic mode may loop forever

parse stack input action1 $0 ( n n ) $2 $0(6 n n ) $3 $0(6n5 n ) $4 $0(6factor4 n ) $6 $0(6term3 n ) $7 $0(6exp10 n ) $ panic!8 $0(6factor4 n ) $ been there before: stage 4!


Panicking and looping

parse stack input action1 $0 ( n n ) $2 $0(6 n n ) $3 $0(6n5 n ) $4 $0(6factor4 n ) $6 $0(6term3 n ) $7 $0(6exp10 n ) $ panic!8 $0(6factor4 n ) $ been there before: stage 4!

• error raised in stage 7, no action possible• panic:

1. pop-off exp102. state 6: 3 goto’s

exp term factorgoto to 10 3 4with n next: action there — reduce r4 reduce r6

3. no shift, so we need to decide between the two reduces4. factor : less general, we take that one

How to deal with looping panic?

• make sure to detec loop (i.e. previous “configurations”)• if loop detected: doen’t repeat but do something special, for instance

– pop-off more from the stack, and try again– pop-off and insist that a shift is part of the options

Left out (from the book and the pensum)

• more info on error recovery• expecially: more on yacc error recovery• it’s not pensum, and for the oblig: need to deal with CUP-specifics (not classic yacc specifics

even if similar) anyhow, and error recovery is not part of the oblig (halfway decent errorhandling is).

212 5 Semantic analysis

Semantic analysisChapter

Whatis it

about?Learning Targets of this Chapter1. “attributes”2. attribute grammars3. synthesized and inherited attributes4. various applications of attribute

grammars

Contents

5.1 Introduction . . . . . . . . . . 2125.2 Attribute grammars . . . . . 216

5.1 Introduction

Semantic analysis in general

Semantic analysis or static analysis is a very broad and diverse topic. The lecture concentrates ona few, but crucial aspects. This particular chapter here is concerned with attribute grammars. It’sa generic or general “framework” to “semantic analysis”. Later chapters also deal with semanticanalysis, namely the one about symbol tables and for type checking. In the context of the lecture,those chapters all work basically on (abstract syntax) trees (except that for the symbol tables andfor the type system, it’s not so visible). The fact that it’s a mechanism to “analyze trees” is mostvisible for attribute grammars: context-free grammars describe trees and the semantic rules (seelater) added to the grammar specify how to analyze resulting trees.

Wrt. the general placement of semantic analysis in a compiler: First, not all semantic analyses are“tree analyses”. Data flow analysis (on which we touch upon later) often works on graphs (typicallycontrol flow graphs). Furthermore, it’s not the case, that semantic analysis restricted to be donedirectly after parsing. There are many semantic analyses that are done at later stages (and onother representations). In particular, it could be that a later intermediate representation uses adifferent form of syntax, closer to machine code (often call intermediate code). That syntax couldalso be given by a grammar, meaning that a program in that syntax corresponds to a tree of thatsyntax. As a result, one can apply techniques like attribute grammars also at that level (maybethereby using in on the AST, and later differently on some intermediate code).

Overview

On a very high level, the attribute grammar format does the following: it enhances a given grammarby additional, so called semantic rules, which specify how trees conforming to the grammar shouldbe analysed.

Two points might be noted here. First, the AG formalism adds rules on top of context-freegrammars, but the intention is to specify analyses on trees formed according to the given grammar.Secondly, it’s a specification of such tree analyses. The AG format is quite general, meaning thatit allows to express all kinds of ways attributes should be evaluated. If not constrained in some

5 Semantic analysis5.1 Introduction 213

way, the AG formalism can be seen as too expressive in that it leads to specifications contradictthemselves or does not lead to proper implementation.

Part of the chapter therefore will be concerned with restrictions it receives from the parser anabstract syntax tree, and then “analyses” it. On a very high level, as far as attribute grammars isconcerned, semantic analysis is about “tree algorithms”.1 Attribute grammars is a formalism thattakes context-free grammars and adds so-called “semantical rules” to it. AGs in their general formcan be seen as a “specification” formalisism for attributes in a grammar.

Side remark: XML

As some side remark, and not part of the technical content of the lecture: XML is some “ex-change format” or markup language built around “trees”. “markup” is kind of like the oppositeof “mark-down” (tongue-in-cheek): mark-down allows easy textual representation, optimized for“human consumption”. Mark-up offers easy consumption for “machines” (easy, unique parsing,easy exchange of “texts”). That’s why XML reads so horrible to the naked eye.2 Anyway, sincepieces of XML-data are trees, there is also the notion of grammars according to which such trees areconsidered well-formed. In the XML terminology, that corresponds basically to schemas.3 Thatbeing so, there are tools that check whether a tree adheres to a given schema, a problem that inthat form does not present itself in parsing: the parser process generates only trees in the ASTformat. Since XML processing is concerned with “tree processing” (checking, transformation etc),there are some similarities with attribute grammars and some XML related technologies. We don’tgo deeper than that here.

Overview over the chapter resp. SA in general

• semantic analysis in general• attribute grammars (AGs)• symbol tables (not today)• data types and type checking (not today)

What do we get from the parser?

• output of the parser: (abstract) syntax tree• often: in anticipation: nodes in the tree contain “space” to be filled out by SA• examples:

– for expression nodes: types– for identifier/name nodes: reference or pointer to the declaration

1It should be noted that semantic analysis is not restricted to analysing abstract syntax trees that comeout of the parser. That’s, however, the placement in the lecture. Semantic analysis may also beapplied to intermediate representations other than abstract syntax trees. One example being controlflow graphs.

2The build.xml from the oblig is some example of some xml-kind of file, used for “building” a projectwith ant.

3In UML context, the role of a grammar is taken by something with the slightly confusing title “meta-model”.

214 5 Semantic analysis5.1 Introduction

assign-expr

subscript expr

identifiera

identifierindex

additive expr

number2

number4

assign-expr

additive-expr

number

2

number

4

subscript-expr

identifier

index

identifier

a :array of int :int

:array of int :int

:int :int

:int :int

:int :int

: ?

By “space”, one might think of fields or instance variables in an object-oriented setting. Fields canbe seen as one way to implenent “attributes”. When introducing attribute grammars, the notion ofattribute will be a specific concept, namely the specific form of attributes in an attribute grammar.But very generally, an “attribute” means just a “property attached to some element”. Typicallyhere, attached to syntactic representations of the language, in particular to nodes in the abstractsyntax tree. Since the notion of attribute is so general, it can take very different forms (like types,data flow information, all kind of extra information). Also, attributes in that sense, need to be“attached” to abstract syntax tree only. For instance, data flow information is extra information(calculated by data flow analysis) not to a syntax tree, but to something called a control-flowgraph. So, since such graphs are not described by context-free grammars, and therefore, data flowanalyses will not be described by attribute grammars.4

General: semantic (or static) analysis

Rule of thumb

Check everything which is possible before executing (run-time vs. compile-time), but cannot al-ready done during lexing/parsing (syntactical vs. semantical analysis)

Rest

• Goal: fill out “semantic” info (typically in the AST)• typically:

– all names declared? (somewhere/uniquely/before use)– typing:

∗ is the declared type consistent with use∗ types of (sub)-expression consistent with used operations

• border between sematical vs. syntactic checking not always 100% clear– if a then ...: checked for syntax (and semantics)

4Besides the reason mentioned —data-flow analyses typically operate on graphs, not trees— there is asecond (but closely related) reason why DFA will in general not be done with AGs; the evaluation ofAGs on a concrete tree explicitly disallows cycles in the dependency graph (see later). DFA in thegeneral form definitely will have to handle cyclic situations.

5 Semantic analysis5.1 Introduction 215

– if a + b then ...: semantical aspects as well?

SA is nessessarily approximative

• note: not all can (precisely) be checked at compile-time– division by zero?– “array out of bounds”– “null pointer deref” (like r.a, if r is null)

• but note also: exact type cannot be determined statically either

if x then 1 else "abc"

• statically: ill-typed5

• dynamically (“run-time type”): string or int, or run-time type error, if x turns out notto be a boolean, or if it’s null

The fact that one cannot precisely check everything at compile-time is due to fundamental reasons.It’s fundamentally impossible to predict the behavior of a program (provided, the programminglanguage is expressive enough = Turing complete, which can be taken as granted for all generalprogramming languages). The “fundamental reasons” mentioned above basically is a result of thefamous halting problem. The particular version here is a consequence of that halting problem andis know as Rice’s theorem. Actually it’s more pessimisic than the sentence on the slide: Ricestipulates: all non-trivial semantic problems of a programming language are undecidable. If itwere otherwise, the halting problem would be decidable as well (which it isn’t, end-of-proof). Notethat approximative checking is doable, resp. that’s what the SA is doing anyhow.

As for type checking: the footnote refers to something which is a form of polymorphism, whichis a form of “laxness” or “liberarlity” of the type system, which allows that some element of thelanguage can have more than one type. In the particular example, it would be a specific formof polymorphism, namely (operator) overloading, in that + is used for addition as well as stringconcatenation. Additionally, in this particular situation, 1 is not just an integer, but also a string.The type checker may allow that, but if so, the later phases of the compiler must arrange it sothat 1 is actually converted to a string (integers and strings are not represented uniformely in that"42" and 42 typically have not the same bit-level representation, so the compiler has to arrangesomethere here.).

An unrealistic dream

Spec. of the lan-guage’s static semantic

“semantical yacc”

static semantical checker

• no standard description language• no standard “theory”

– part of SA may seem ad-hoc, more “art” than “engineering”, complex• but: well-established/well-founded (and non-ad-hoc) fields do exist

– type systems, type checking5Unless some fancy behind-the-scence type conversions are done by the language (the compiler). Perhapsprint(if x then 1 else "abc") is accepted, and the integer 1 is implicitly converted to "1".

216 5 Semantic analysis5.2 Attribute grammars

– data-flow analysis . . . .

• in general– semantic “rules” must be individually specified and implemented per language– rules: defined based on trees (for AST): often straightforward to implement– clean language design includes clean semantic rules

When saying that there is no general standard theory, of course there would be the notion ofcontext-free grammars, a class of grammars more expressive than context free grammars, while notyet as expressive as Turing machines (= full compuation power). The notion of context-sensitivelanguages is sure well-defined, but as a formalism, it’s too general, too unstructured to give muchguiding light when it comes to concrete problems being analysed. Context-sensitive grammars assuch are not on the pensum.

5.2 Attribute grammars

Attributes

Attribute

• a “property” or characteristic feature of something• here: of language “constructs”. More specific in this chapter:• of syntactic elements, i.e., for non-terminal and terminal nodes in syntax trees

Static vs. dynamic

• distinction between static and dynamic attributes• association attribute ↔ element: binding• static attributes: possible to determine at/determined at compile time• dynamic attributes: the others . . .

With the concept of attribute so general, very many things can be subsumed under being anattribute of “something”. After having a look at how attribute grammars are used for “attribution”(or “binding” of values of some attribute to a syntactic element), we will normally be concernedwith more concrete attributes, like the type of something, or the value (and there are many otherexamples). In the very general use of the word “attribute” and “attribution” (the act of attributingsomething to something) is almost synonymous with “analysis” (here semantic analysis). Theanalysis is concerned with figuring out the value of some attribute one is interested in, for instance,the type of a syntactic construct. After having done so, the result of the analysis is typicallyremembered (as opposed to being calculated over and over again), but that’s for efficiency reasons.One way of remembering attributes is in a specific data structure, for attributes of “symbols”, thatkind of data structure is known as the symbol table.

Examples in our context

• data type of a variable : static/dynamic• value of an expression: dynamic (but seldomly static as well)• location of a variable in memory: typically dynamic (but in old FORTRAN: static)• object-code: static (but also: dynamic loading possible)

http://www.merriam-webster.com/dictionary/attribute

5 Semantic analysis5.2 Attribute grammars 217

The value of an expression, as stated, is typically not a static “attribute” (for reasons which I hopeare clear). Later in this chapter, we will actually use values of expressions as attributes. Thatcan be done, for instance, if there are no variables mentioned in the expressions. The values ofthose values typically are not known at compile-time and would not allow to calculate the valueat compile time. However, having no variables is exactly the situation we will see later.

As a side remark: even with variables, sometimes the compiler can figure out, that, in somesituations, the value of a variable is at some point is known in advance. In that case, an optimizationcould be to precompute the value and use that instead. To figure out whether or not that is the caseis typically done via data-flow analysis which operates on control-flow graphs (not trees). That istherefore not done via attribute grammars in general.

Attribute grammar in a nutshell

• AG: general formalism to bind “attributes to trees” (where trees are given by a CFG)6

• two potential ways to calculate “properties” of nodes in a tree:

“Synthesize” properties

define/calculate prop’s bottom-up

“Inherit” properties

define/calculate prop’s top-down

• allows both at the same time

Attribute grammar

CFG + attributes one grammar symbols + rules specifing for each production, how to determineattributes

• evaluation of attributes: requires some thought, more complex if mixing bottom-up + top-down dependencies

Example: evaluation of numerical expressions

Expression grammar (similar as seen before)

exp → exp + term | exp− term | termterm → term ∗ factor | factor

factor → ( exp ) | n

• goal now: evaluate a given expression, i.e., the syntax tree of an expression, resp:

6Attributes in AG’s: static, obviously.


more concrete goal

Specify, in terms of the grammar, how expressions are evaluated

• grammar: describes the “format” or “shape” of (syntax) trees• syntax-directedness• value of (sub-)expressions: attribute here

As stated earlier: values of syntactic entities are generally dynamic attributes and cannot thereforebe treated by an AG. In this simplistic example of expressions, it’s statically doable (because thereare no variables and no state-change etc.).

Expression evaluation: how to do if on one’s own?

• simple problem, easy solvable without having heard of AGs• given an expression, in the form of a syntax tree• evaluation:

– simple bottom-up calculation of values– the value of a compound expression (parent node) determined by the value of itssubnodes

– realizable, for example, by a simple recursive procedure

Connection to AG’s

• AGs: basically a formalism to specify things like that• however : general AGs will allow more complex calculations:

– not just bottom up calculations like here but also– top-down, including both at the same time

When talking about recursive procedures, we mean not just direct recursion. Often a number ofmutually recursive procedures is needed, for example, one for factors, one for terms, etc. See thenext slide. The use of such recursive arrangement may remind us to the sections about top-downparsing.

As mentioned, AGs make use of more complex “strategies”, not just pure bottom-up or puretop-down even mixed ones exists. To evaluate the simple expressions here, as pure bottom-upevaluation strategy works well.

Pseudo code for evaluation

eval_exp ( e ) =case: : e matches PLUSnode −>

return eval_exp ( e . l e f t ) + eval_term ( e . r i g h t ): : e matches MINUSnode −>

return eval_exp ( e . l e f t ) − eval_term ( e . r i g h t ). . .end case


AG for expression evaluationproductions/grammar rules semantic rules

1 exp1 → exp2 + term exp1 .val = exp2 .val + term .val2 exp1 → exp2− term exp1 .val = exp2 .val− term .val3 exp → term exp .val = term .val4 term1 → term2 ∗ factor term1 .val = term2 .val ∗ factor .val5 term → factor term .val = factor .val6 factor → ( exp ) factor .val = exp .val7 factor → n factor .val = n.val

• specific for this example is:– only one attribute (for all nodes), in general: different ones possible– (related to that): only one semantic rule per production– as mentioned: rules here define values of attributes “bottom-up” only

• note: subscripts on the symbols for disambiguation (where needed)

Attributed parse tree

The attribute grammar (being purely synthesized = bottom-up) is very simple and hence, thevalues in the attribute val should be self-explanatory. It


Possible dependencies

Possible dependencies (> 1 rule per production possible)

• parent attribute on childen attributes• attribute in a node dependent on other attribute of the same node• child attribute on parent attribute• sibling attribute on sibling attribute• mixture of all of the above at the same time• but: no immediate dependence across generations

The way that the attribute grammars specify dependencies, namely taking a grammar and addsemantic rules on top of the given productions puts those restrictions on the (direct) dependencies.It’s quite natural. One cannot specify a dependency between attributes making use of two or moreproductions. Of course, there can by an indirect dependency.

Attribute dependence graph

• dependencies ultimately between attributes in a syntax tree (instances) not between grammarsymbols as such

⇒ attribute dependence graph (per syntax tree)• complex dependencies possible:

– evaluation complex– invalid dependencies possible, if not careful (especially cyclic)

Sample dependence graph (for later example)

The graph belongs to an example we will revisit later. The dashed line represent the AST. Thebold arrows the dependence graph. Later, we will classify the attributes in that base (at least forthe non-terminals num) is inherited (“top-down”), whereas val is synthesized (“bottom-up”).

We will later have a closer look at what synthesized and inherited means. As we see in the examplealready here, being synthesized is (in its more general form) not as simplistic as “dependence onlyfrom attributes of children”. In the example the synthesized attribute val depends on its inherited“sister attribute” base in most nodes. So, synthesized is not only “strictly bottom-up”, it also goes


“sideways” (from base to val). Now, this “sideways” dependence goes from inherited to synthesizedonly but never the other way around. That’s fortunate, because in this way it’s immediately clearthat there are no cycles in the dependence graph. An evaluation (see later) following this formof dependence is “down-up”, i.e., first top-down, and afterwards bottom-up (but not then downagain etc., the evaluation does not go into cycles).

Two-phase evaluation

Perhaps a too fine point concerning evaluation in the example. The above explanation highlightedthat the evaluation is “phased” in first a top-down evaluation and afterwards a bottom-up phase.Conceptually, that is correct and gives a good intuition about the design of the dependencies of theattribute. Two “refinements” of that picture may be in order, though. First, as explained later, adependence graph does not represent one possible evaluation (so it makes no real sense in speakingof “the” evaluation of the given graph, if we think of the edges as individual steps). The graphdenotes which values need to be present before another value can be determined. Secondly, andrelatd to that: If we take that view seriously, it’s not strictly true that all inherited depenenciesare evaluated before all synthesized. “Conceptually” they are, in a way, but there is an amount of“independence” or “parallelism” possible. Looking at the following picture, which depicts one ofmany possible evaluation orders, shows that, for example, step 8 is filling an inherited attribute,and that comes after 6 which deals with an synthesized one. But both steps are independent, sothey could as well be done the other way around.

So, the picture “first top-down, then bottom-up” is conceptually correct and a good intuition, itneeds some fine-tuning when talking about when an indivdual step-by-step evaluation is done.

Possible evaluation order

The numbers in the picture give one possible evaluation order. As mentioned earlier, there is ingeneral more than one possible way to evaluate dependency graph, in particular, when dealing witha syntax tree, and not with the generate case of a “ syntax list” (considering lists as a degeneratedform of trees). Generally, the rules that say when an AG is properly done assure that all possibleevaluations give a unique value for all attributes, and the order of evaluation does not matter.Those conditions assure that each attribute instance gets a value exactly once (which also impliesthere are no cycles in the dependence graph).


Restricting dependencies

• general GAs allow bascially any kind of dependencies7

• complex/impossible to meaningfully evaluate (or understand)• typically: restrictions, disallowing “mixtures” of dependencies

– fine-grained: per attribute– or coarse-grained: for the whole attribute grammar

Synthesized attributes

bottom-up dependencies only (same-node dependency allowed).

Inherited attributes

top-down dependencies only (same-node and sibling dependencies allowed)

The classification in inherited = top-down and synthesized = bottom-up is a general guiding light.The discussion about the previous figures showed that there might be some refinements like that“sideways” dependencies are acceptable, not only strictly bottom-up dependencies.

Synthesized attributes (simple)

Synthesized attribute

A synthesized attribute is defined wholly in terms of the node’s own attributes, and those of itschildren (or constants).

Rule format for synth. attributes

For a synthesized attribute s of non-terminal A, all semantic rules with A.s on the left-hand sidemust be of the form

A.s = f(X1.b1, . . . Xn.bk) (5.1)

and where the semantic rule belongs to production A→ X1 . . . Xn

• Slight simplification in the formula.

The “simplification” here is that we ignore the fact that one symbol can have in general manyattributes. So, we just write X1.b1 instead of X1.b1,1 . . . X1.b1.k1 which would more “correctly”cover the situation in all generality, but doing so would not make the points more clear.

7Apart from immediate cross-generation dependencies.


S-attributed grammar:

all attributes are synthesized

The simplification mentioned is to make the rules more readable, to avould all the subscript, whilekeeping the spirit. The simplification is that we consider only 1 attribute per symbol. In general,instead depend on A.a only, dependencies on A.a1, . . . A.al possible. Similarly for the rest of theformula

Remarks on the definition of synthesized attributes

• Note the following aspects1. a synthesized attribute in a symbol: cannot at the same time also be “inherited”.2. a synthesized attribute:

– depends on attributes of children (and other attributes of the same node) only.However:

– those attributes need not themselves be synthesized (see also next slide)

• in Louden:– he does not allow “intra-node” dependencies– he assumes (in his wordings): attributes are “globally unique”

Unfortunately, depending on the text-book the exact definitions (or the way it’s formulated) ofsynthesized and inherited slightly deviate. But in spirit, of course, they all agree in principle.Without going into detail, one may find different opinions on the question: can an synthesizedattribute be inherited at the same time? AGs would allow that an attribute (in a non-terminal) isdependent on parents at the same time as on children. That’s generally not usefule, so most booksdefine being synthesized as depending only on children (an perhaps siblings) and rule out the non-useful case. Basically interpreting “synthesized” as “synthesized-only” so to say. Others choosethe words differently, and then saying that attributes (to be usefully evaluated) must not be bothsynthesized and inherited at the same time. That confusion may be frustrating, but it’s a matterof terminology, not substance. Both terminological camps would agree: a double-dependency fromboth above in the tree and, at the same time for the same attribute, from below in the tree, mustbe avoided, to allow evaluation.

The lecture is not so much concerned with the super-fine print in definitions or best terminology,more with questions like “given the following problem, write an AG”, and the conceptual pictureof synthesized (bottom-up and a bit of sideways), and inherited (top-down and perhaps a bitof sideways) helps in thinking about that problem. Of course, all books agree: cycles must beavoided and all attributes need to be uniquely defined. The concepts of synthesized and inheritedattributes thereby helps to clarify thinking about those problems. For intance, by having this“phased” evaluation discussed earlier (first down with the inherited attributes, then up with thesynthesized one) makes clear: there can’t be a cycle.

Don’t forget the purpose of the restriction

• ultimately: calculate values of the attributes• thus: avoid cyclic dependencies• one single synthesized attribute alone does not help much


S-attributed grammar

• restriction on the grammar, not just 1 attribute of one non-terminal• simple form of grammar• remember the expression evaluation example

S-attributed grammar:

all attributes are synthesized

Alternative, more complex variant

“Transitive” definition (A→ X1 . . . Xn)

A.s = f(A.i1, . . . , A.im, X1.s1, . . . Xn.sk)

• in the rule: the Xi.sj ’s synthesized, the Ai.ij ’s inherited• interpret the rule carefully: it says:

– it’s allowed to have synthesized & inherited attributes for A– it does not say: attributes in A have to be inherited– it says: in an A-node in the tree: a synthesized attribute

∗ can depend on inherited att’s in the same node and∗ on synthesized attributes of A-children-nodes

Pictorial representation

Conventional depiction

General synthesized attributes

Note that in the previous example discussing the dependence graph with attributes base and valwas of this format and followed the convention: show the inherited base on the left, the synthesizedval on the right.


Inherited attributes

• in Louden’s simpler setting: inherited = non-synthesized

Inherited attribute

An inherited attribute is defined wholly in terms of the node’s own attributes, and those of itssiblings or its parent node (or constants).

Rule format

Rule format for inh. attributes

For an inherited attribute of a symbol X of X, all semantic rules mentioning X.i on the left-handside must be of the form

X.i = f(A.a, X1.b1, . . . , X, . . .Xn.bk)

and where the semantic rule belongs to production A→ X1 . . . X, . . .Xn

• note: mentioning of “all rules”, avoid conflicts.

Alternative definition (“transitive”)

Rule format

For an inherited attribute i of a symbol X, all semantic rules mentioning X.i on the left-handside must be of the form

X.i = f(A.i′, X1.b1, . . . , X.b, . . . Xn.bk)

and where the semantic rule belongs to production A→ X1 . . . X . . .Xn

• additional requirement: A.i′ inherited• rest of the attributes: inherited or synthesized


Simplistic example (normally done by the scanner)

CFG

number → numberdigit | digitdigit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

Attributes (just synthesized)

number valdigit valterminals [none]

We will look at an AG solution. In practice, this conversion is typically done by the scanner already,and the way it’s normally done is relying on provided functions of the implementing programminglanguage (all languages will support such conversion functions, either built-in or in some libraries).For instance in Java, one could use the method valueOf(String s), for instance used as staticmethod Integer.valueOf("900") of the class of integers. Obviously, not everything done byan AG can be done already by the scanner. But this particular example used as warm-up is sosimple that it could be done by the scanner, and that where’s it’s done mostly anyway.

Numbers: Attribute grammar and attributed tree

A-grammar


attributed tree

Attribute evaluation: works on trees

i.e.: works equally well for

• abstract syntax trees• ambiguous grammars

Seriously ambiguous expression grammar

exp → exp + exp | exp− exp | exp ∗ exp | ( exp ) | n

Alternatively: The grammar is not meant as describing the syntax for the parser, it’s meant asgrammar describing nice and clean ASTs for an underlying, potentially less nice grammar used forparsing. Remember: grammars describe trees, and one can use EBNF to describe ASTs.

Evaluation: Attribute grammar and attributed tree

A-grammar


Attributed tree

Expressions: generating ASTs

Expression grammar with precedences & assoc.exp → exp + term | exp− term | term

term → term ∗ factor | factorfactor → ( exp ) | n

Attributes (just synthesized)

exp, term, factor treen lexval

Expressions: Attribute grammar and attributed tree

A-grammar


A-tree

The AST looks a bit bloated. That’s because the grammar was massaged in such a way that prece-dences and associativities during parsing are dealt with properly. The the grammar is describingmore a parse tree rather than an AST, which often would be less verbose. But the AG formalismsitself does not care about what the grammar describes (a grammar used for parsing or a grammardescribing the abstract syntax), it does especially not care if the grammar is ambiguous.

Example: type declarations for variable lists

CFG

decl → type var-listtype → inttype → float

var-list1 → id, var-list2var-list → id

• Goal: attribute type information to the syntax tree• attribute: dtype (with values integer and real)• complication: “top-down” information flow: type declared for a list of vars ⇒ inherited to

the elements of the list

concerning dtype: There are thus 2 different attribute values. We don’t mean “the attribute dtypehas integer values”, like 0, 1, 2, . . .


Types and variable lists: inherited attributes

grammar productions semantic rulesdecl → type var-list var-list .dtype = type .dtypetype → int type .dtype = integertype → float type .dtype = real

var-list1 → id, var-list2 id.dtype = var-list1 .dtypevar-list2 .dtype = var-list1 .dtype

var-list → id id.dtype = var-list .dtype

• inherited: attribute for id and var-list• but also synthesized use of attribute dtype: for type .dtype8

The dependencies are (especially for the variable lists) in such a way that the attribute of a laterelement depends on an ealier; in other words, the type information propagates from left to rightthrough the “list”. Seen as a tree, that means, the information propagates top-down in the tree.That can be seen in the next (quite small) example: the type information (there float) propagatesdown the right-branch of the tree, which corresponds to the list of two variables x and y.

Types & var lists: after evaluating the semantic rules

float id(x),id(y)


Dependence graph

8Actually, it’s conceptually better not to think of it as “the attribute dtype”, it’s better as “the attributedtype of non-terminal type” (written type .dtype) etc. Note further: type .dtype is not yet what wecalled instance of an attribute.


Example: Based numbers (octal & decimal)

Explanations

The based numbers are a rather well-known example for illustrating synthesized and inheritedattributed. Well-known insofar that they are covered in many text-books talking about AGs. Thefact that it’s inherited and synthesized can easly be seen intuitively: if one wants to evaluate such anumber, one would do that left-to-right (which corresponds to top-down), however, the evaluationdoes not yet know how to calculate until it seems the last piece of information, the specificationof what number system to use (decimal or octal). The piece of information has to be calculatedresp. carried along “in the opposite direction”.

In a way, the notation is designed silly in a way: it’s like having a compressed or encrypted file, andthen putting the kind of meta-information how to interpret the data not into the header, where itwould belong, but at the end. . .

• remember: grammar for numbers (in decimal notation)• evaluation: synthesized attributes• now: generalization to numbers with decimal and octal notation

Context-free grammar

based-num → num base-charbase-char → obase-char → d

num → num digitnum → digitdigit → 0digit → 1

. . .digit → 7digit → 8digit → 9

Based numbers: attributes

Attributes

• based-num .val: synthesized• base-char .base: synthesized• for num:

– num .val: synthesized– num .base: inherited

• digit .val: synthesized

• 9 is not an octal character⇒ attribute val may get value “error”!


Based numbers: a-grammar

The attribute grammar should rather be straightforward and the next slides will shed light onthe dependencies and the evaluation. That illustrates the synthesized vs. the inherited partsperhaps more clearly than the equations of the semantic rules. As mentioned in the slides: theevaluation can lead to errors insofar that for base-8 numbers, the characters 8 and 9 are notallowed. Technically, to be a proper attribute grammar, a value need to be attached to eachattribute instance for each tree. If we would take that serious, it required that we had to giveback an “error” value, as can be seen in the code of the semantic rules. If we take that even moreseriously, it would mean that the “type” of the val attribute is not just integers, but integers oran error value.

In a practical implementation, one would probably rather operate with exceptions, to achievethe same. Technically, an exception is not a ordinary value which is given back, but interruptsthe standard control-flow as well. That kind of programming convenience is outside the (purelyfunctional/equational) framework of AGs, and therefore, the given semantic rules deal the extraerror value explicitly and evaluation propagate errors explicitly; since the errors occur duringthe “calculation phase”, i.e., when dealing with the synthesized attribute, an error is propagatedupwards the tree.


Based numbers: after eval of the semantic rules

Attributed syntax tree

Based nums: Dependence graph & possible evaluation order


Dependence graph & evaluation

• evaluation order must respect the edges in the dependence graph• cycles must be avoided!• directed acyclic graph (DAG)• dependence graph ∼ partial order• topological sorting: turning a partial order to a total/linear order (which is consistent with

the PO)• roots in the dependence graph (not the root of the syntax tree): their values must come

“from outside” (or constant)• often (and sometimes required): terminals in the syntax tree:

– terminals synthesized / not inherited⇒ terminals: roots of dependence graph⇒ get their value from the parser (token value)

A DAG is not a tree, but a generalization thereof. It may have more than one “root” (like a forest).Also: “shared descendents” are allowed. But no cycles.

As for the treatment of terminals, resp. restrictions some books require: An alternative view isthat terminals get token values “from outside”, the lexer. They are as if they were synthesized,except that it comes “from outside” the grammar.

Evaluation: parse tree method

For acyclic dependence graphs: possible “naive” approach

Parse tree method

Linearize the given partial order into a total order (topological sorting), and then simply evaluatethe equations following that.

Rest

• works only if all dependence graphs of the AG are acyclic• acyclicity of the dependence graphs?


– decidable for given AG, but computationally expensive9

– don’t use general AGs but: restrict yourself to subclasses

• disadvantage of parse tree method: also not very efficient check per parse tree

Observation on the example: Is evalution (uniquely) possible?

• all attributes: either inherited or synthesized10

• all attributes: must actually be defined (by some rule)• guaranteed in that for every production:

– all synthesized attributes (on the left) are defined– all inherited attributes (on the right) are defined– local loops forbidden

• since all attributes are either inherited or synthesized: each attribute in any parse tree:defined, and defined only one time (i.e., uniquely defined)

Loops

• loops intolerable for evaluation• difficult to check (exponential complexity).

Acyclicity checking for a given dependence graph: not so hard (e.g., using topological sorting).Here the question is: for all syntax trees.

Variable lists (repeated)


Dependence graph

9On the other hand: the check needs to be done only once.10base-char .base (synthesized) considered different from num .base (inherited)


Typing for variable lists

• code assume: tree given

The assumption that the tree is given is reasonable, if dealing with ASTs. For parse-tree, theattribution of types must deal with the fact that the parse tree is being built during parsing. Italso means: it “blurs” typically the border between context-free and context-sensitive analysis.

L-attributed grammars

• goal: AG suitable for “on-the-fly” attribution• all parsing works left-to-right.

L-attributed grammar

An attribute grammar for attributes a1, . . . , ak is L-attributed, if for each inherited attribute aj

and each grammar ruleX0 → X1X2 . . . Xn ,

the associated equations for aj are all of the form

Xi.aj = fij(X0.~a, X1.~a . . . Xi−1.~a) .

where additionally for X0.~a, only inherited attributes are allowed.


Rest

• X.~a: short-hand for X.a1 . . . X.ak

• Note: S-attributed grammar ⇒ L-attributed grammar

Nowadays, doing it on-the-fly is perhaps not the most important design criterion.

“Attribution” and LR-parsing

• easy (and typical) case: synthesized attributes• for inherited attributes

– not quite so easy– perhaps better: not “on-the-fly”, i.e.,– better postponed for later phase, when AST available.

• implementation: additional value stack for synthesized attributes, maintained “besides” theparse stack

Example: value stack for synth. attributes

Sample action

E : E + E { $$ = $1 + $3 ; }

in (classic) yacc notation

Value stack manipulation: that’s what’s going on behind the scene

238 6 Symbol tables

Symbol tablesChapter

Whatis it

about?Learning Targets of this Chapter1. symbol table data structure2. design and implementation choices3. how to deal with scopes4. connection to attribute grammars

Contents

6.1 Introduction . . . . . . . . . . 2386.2 Symbol table design and in-

terface . . . . . . . . . . . . . 2396.3 Implementing symbol tables . 2406.4 Block-structure, scoping,

binding, name-space organi-zation . . . . . . . . . . . . . 246

6.5 Symbol tables as attributesin an AG . . . . . . . . . . . 252

6.1 Introduction

Symbol tables, in general

• central data structure• “data base” or repository associating properties with “names” (identifiers, symbols)1

• declarations– constants– type declarationss– variable declarations– procedure declarations– class declarations– . . .

• declaring occurrences vs. use occurrences of names (e.g. variables)

• goal: associate attributes (properties) to syntactic elements (names/symbols)• storing once calculated: (costs memory) ↔ recalculating on demand (costs time)• most often: storing preferred• but: can’t I store it in the nodes of the AST?

– remember: attribute grammar– however, fancy attribute grammars with many rules and complex synthesized/inherited

attribute (whose evaluation traverses up and down and across the tree):∗ might be intransparent∗ storing info in the tree: might not be efficient

⇒ central repository (= symbol table) better

1Remember the (general) notion of “attribute”.

6 Symbol tables6.2 Symbol table design and interface 239

So: do I need a symbol table?

In theory, alternatives exists; in practice, yes, symbol tables is the way to go; most compilers douse symbol tables.

Most often (and in our course), the symbol table is set up once, containing all the symbols thatoccur in a given program, and then, the semantic analyses (type checking, etc.) update the tableaccordingly. Implicit in that is that the symbol table is “static” (i.e., part of the static phase ofthe compiler). There are also some languages, which allow “manipulation” of symbol tables at runtime (Racket is one (formerlly PLT scheme)).

The slides make a point that basically every compiler has a symbol table (or even more than one).You find statements in the internet that symbol tables are not needed or even to be avoided. Forinstance, the stack overflow wisdom “no symbol tables in Go” claims that there are no symbol tablesin Go (and in functional languages). It’s not clear how reliable that information is, because here’sa link https://golang.org/pkg/debug/gosym/ to the official go implementation, referringto symbol tables.

6.2 Symbol table design and interface

Symbol table as abstract data type

• separate interface from implementation• ST: “nothing else” than a lookup-table or dictionary• associating “keys” with “values”• here: keys = names (id’s, symbols), values the attribute(s)

Schematic interface: two core functions (+ more)

• insert: add new binding• lookup: retrieve

besides the core functionality:

• structure of (different?) name spaces in the implemented language, scoping rules• typically: not one single “flat” namespace ⇒ typically not one big flat look-up table⇒ influence on the design/interface of the ST (and indirectly the choice of implementation)• necessary to “delete” or “hide” information (delete)

A symbol table is, typically, not just a “flat” dictionary, neither conceptually nor the way it’simplemented. Scoping typically is something that often complicates the design of the symboltable.

It should also be clear from the context of discussion: when we speak of the value of an attribute wetypically don’t mean the semantic value of the symbol, like the integer value of an expression. Thevalue of an attribute is meant in the “meta”-way, the value that the analysis attaches to the entity,for instance its type, its address, etc. (and only in rather rare cases, its programming languagelevel value). The situation is the same as for attribute grammars and indeed, symbol tables canbe seen as a data structure realizing “attributes”. See also the next slide, contrasting two ways ofattaching “attributes” to entities in a (syntax) tree: “internal”, as part of the nodes, or external,in a separate repository (known as symbol table).

https://racket-lang.org/

https://stackoverflow.com/questions/1725975/no-symbol-table-in-go

https://golang.org/pkg/debug/gosym/

240 6 Symbol tables6.3 Implementing symbol tables

Two main philosophies

traditional table(s)

• central repository, separate from AST• interface

– lookup(name),– insert(name, decl),– delete(name)

• last 2: update ST for declarations and when entering/exiting blocks

decls. in the AST nodes

• do look-up ⇒ tree-search• insert/delete: implicit, depending on relative positioning in the tree• look-up:

– efficiency?– however: optimizations exist, e.g. “redundant” extra table (similar to the traditional

ST)

Here, for concreteness, declarations are the attributes stored in the ST. In general, it is not theonly possible stored attribute. Also, there may be more than one ST.

Language often have different “name spaces”. Even a relatively old-school language like C has 4different name spaces for identifiers. There are different kinds of identifiers, and different rules (forinstance wrt. scoping) apply to them. One way to arrange them could be to have different symboltables, one specially for each name space. Later we will also have the situation (but not caused bydifferent kinds of identifiers), where the symbol table is arranged in such a way that smaller symboltables (per scope) are linked together where a symbol table of a “surrounding” scope points to asymbol table representing a scope nested deeper. One might see that as having “many” symboltables, but maybe that’s misleading. It’s more an internal representation with a linked structure,but that data structure containing many individiual table is better seen conceptually as one symboltable (the symbol table of the language), but one with a complex behavior reflecting the lexicalscoping of the language. Actually, whether or not one implements it in chaining up a bunch ofindividual hash tables or similar structures or doing a different representation, is a design choice,both realizing the same external behavior at the interface. In that spirit, also the remark that Chas 4 different name spaces (which is true) and therefore a C compiler may make use of 4 symboltables is a matter of how one sees (and implements) it: one may as well see and implement it asone symbol table (with 4 different kinds of identifiers which are treated differently).

A cautionary note: You may find the statement that C (being old-fashioned) does not feature namespaces. The discussion here was about the internal organization and scoping rules for various classesof identifers in C, which form internally 4 different name spaces. But C does not have elaborateuser-level mechanisms to introduce name spaces; therefore, one may stumble upon statements like“C does not support name spaces”. . .

6.3 Implementing symbol tables

Data structures to implement a symbol table

• different ways to implement dictionaries (or look-up tables etc.)– simple (association) lists

6 Symbol tables6.3 Implementing symbol tables 241

– trees∗ balanced (AVL, B, red-black, binary-search trees)

– hash tables, often method of choice– functional vs. imperative implementation

• careful choice influences efficiency• influenced also by the language being implemented• in particular, by its scoping rules (or the structure of the name space in general) etc.2

Nested block / lexical scope

for instance: C{ int i ; . . . ; double d ;

void p ( . . . ) ;{

int i ;. . .

}int j ;. . .

more later

Blocks in other languages

TEX

\def\x{a}{

\def\x{b}\x

}\x\bye

LATEX

\documentclass { a r t i c l e }\newcommand{\x}{a}\begin{document}\x{\renewcommand{\x}{b}

\x}\x\end{document}

But remember: static vs. dynamic binding (see later)

LATEX and TEX are chosen for easy trying out the result oneself (assuming that most people haveaccess to LATEX and by implication, TEX). TEX is the underlying “core” on which LATEX is put ontop. There are other formats in top of TEX (texi is another one; texi is involved, for instance,type setting the pdf version of the Compila language specification)

2Also the language used for implementation (and the availability of libraries therein) may play a role


Hash tables

• classical and common implementation for STs• “hash table”:

– generic term itself, different general forms of HTs exists– e.g. separate chaining vs. open addressing

There exists alternative terminology (cf. INF2220 in the older numbering scheme, it’s the algo &data structures lecture), under which separate chaining is also known as open hashing. The openaddressing methods are also called closed hashing. It’s confusing, but that’s how it is, and it’s justwords.

Separate chaining

Code snippet

{int temp ;int j ;real i ;void s i z e ( . . . . ) {

{. . . .

}}

}

Block structures in programming languages

• almost no language has one global namespace (at least not for variables)• pretty old concept, seriously started with ALGOL60

Block

• “region” in the program code• delimited often by { and } or BEGIN and END or similar• organizes the scope of declarations (i.e., the name space)• can be nested


Block-structured scopes (in C)

int i , j ;

int f ( int s i z e ){ char i , temp ;

. . .{ double j ;

. .}. . .{ char ∗ j ;

. . .}

}

Nested procedures in Pascal

program Ex ;var i , j : integer

function f ( s i z e : integer ) : integer ;var i , temp : char ;

procedure g ;var j : real ;begin

. . .end ;procedure h ;var j : ^char ;begin

. . .end ;

begin (∗ f ' s body ∗). . .

end ;begin (∗ main program ∗)

. . .end .

The Pascal-example shows a feature of Pascal, which is not supported by C, namely nested dec-larations of functions or procedures. As far as scoping and the discussion at the current point inthe lecture is concerned, that’s not a big issue: just that concerning names for variables, C andPascal allow nested blocks, but for names representing functions or procedures, Pascal offers morefreedom.

Block-strucured via stack-organized separate chaining

C code snippet


int i , j ;

int f ( int s i z e ){ char i , temp ;

. . .{ double j ;

. .}. . .{ char ∗ j ;

. . .}

}

“Evolution” of the hash table

The 3 pictures (shown on the right-hand side of the slide version) correpond to three “points”inside the C program. The first one after entering the scope of function f. Inside the body of thefunction (immediately after entering), the two local variables are available, and of course also theformal parameter temp, which can be seen as a local variable, as well. At that point, the global


variable i of type int is no longer “visible” or accessible, any reference to i will refer to the localvariable i at that point.

Upon entering the first nested local scope, a second variable j is entered (making the global variablej unaccessible). That situation is not shown in the pictures. New, when leaving the mentionedscope, one way of dealing with the situation is that the additional second j of type double isremoved from the hash-table again (shortering the corresponding linked chain). What is shown isa situation inside the second nested scope with another variable j (now a char pointer). Since thefirst nested local scope has been left at that point, the corresponding j “has become history”, andthe hash table of the third picture only contains the global j variable (which is unaccessible) andthe now relevant second local j variable.

Using the syntax tree for lookup following (static links)

lookup ( string n) {k = current , surrounding blockdo // search for n in dec l for block k ;

k = k . s l // one ne s t i ng l e v e l upuntil found or k == none

}

The notion of static link will be discussed later, in connection with the so-called run-time systemand the run-time stack. There we go into more details, but the idea is the same as here: find away to “locate” the relevant scope. If they are nested, connect them via some “parent pointer”,and that pointer is known as static links (again, different names exists for that, unfortunately).

Alternative representation:

• arrangement different from 1 table with stack-organized external chaining• each block with its own hash table.• standard hashing within each block

246 6 Symbol tables6.4 Block-structure, scoping, binding, name-space organization

• static links to link the block levels⇒ “tree-of-hashtables”• AKA: sheaf-of-tables or chained symbol tables representation

Note that the top-most scope is at the right-hand side of the table, and the static-link alwayspoints to the (uniquely determined) surrounding scope.

One may more generally say: one symbol table per block, as this form of organization can generallybe done for symbol tables data structures (where hash tables is just one of many possible datastructure to implement look-up tables).

6.4 Block-structure, scoping, binding, name-space organization

Block-structured scoping with chained symbol tables

• remember the interface• look-up: following the static link (as seen)3

• Enter a block– create new (empty) symbol table– set static link from there to the “old” (= previously current) one– set the current block to the newly created one

• at exit– move the current block one level up– note: no deletion of bindings, just made inaccessible

Lexical scoping & beyond

• block-structured lexical scoping: central in programming languages (ever since ALGOL60. . . )

• but: other scoping mechanism exists (and exist side-by-side)• example: C++

– member functions declared inside a class– defined outside

• still: method supposed to be able to access names defined in the scope of the class definition(i.e., other members, e.g. using this)

3The notion of static links will be encountered later again when dealing with run-time environments (andfor analogous purposes: identfying scopes in “block-stuctured” languages).

6 Symbol tables6.4 Block-structure, scoping, binding, name-space organization 247

C++ class and member function

class A {. . . int f ( ) ; . . . // member func t i on

}

A : : f ( ) {} // de f . o f f `` in ' ' A

Java analogon

class A {int f ( ) { . . . } ;boolean b ;void h ( ) { . . . } ;

}

Scope resolution in C++

• class name introduces a name for the scope4 (not only in C++)• scope resolution operator ::• allows to explicitly refer to a “scope”’

• to implement– such flexibility,– also for remote access like a.f()

• declarations are kept separately for each block (e.g. one hash table per class, record, etc.,appropriately chained up)

Same-level declarations

Same level

typedef int iint i ;

• often forbidden (e.g. in C)• insert: requires check (= lookup) first

4Besides that, class names themselves are subject to scoping themselves, of course . . .


Sequential vs. “collateral” declarations

Sequential in Cint i = 1 ;void f (void )

{ int i = 2 , j = i +1,. . .

}

Collateral in ocaml/ML/Lisplet i = 1 ; ;let i = 2 and y = i +1; ;

pr int_int ( y ) ; ;

I think the name “collateral” is unfortunate. A better word, in my eyes, would be simultaneous(or parallel).

Recursive declarations/definitions

• for instance for functions/procedures• also classes and their members

Direct recursion

int gcd ( int n , int m) {i f (m == 0) return n ;else return gcd (m, n % m) ;

}

Indirect recursion/mutual recursive def’s

void f (void ) {. . . g ( ) . . . }

void g (void ) {. . . f ( ) . . . }

Before treating the body, parser must add gcd into the symbol table (similar for the other exam-ple).

Mutual recursive definitions

void g (void ) ; /∗ f unc t i on pro to type d e c l . ∗/

void f (void ) {. . . g ( ) . . . }

void g (void ) {. . . f ( ) . . . }

• different solutions possible


• Pascal: forward declarations• or: treat all function definitions (within a block or similar) as mutually recursive• or: special grouping syntax

Example syntax-es for mutual recursion

ocaml

let rec f ( x : i n t ) : i n t =g (x+1)

and g (x : i n t ) : i n t =f (x+1) ; ;

Go

func f ( x int ) ( int ) {return g (x ) +1

}

func g (x int ) ( int ) {return f ( x ) −1

}

Static vs dynamic scope

• concentration so far on:– lexical scoping/block structure, static binding– some minor complications/adaptations (recursion, duplicate declarations, . . . )

• big variation: dynamic binding / dynamic scope• for variables: static binding/ lexical scoping the norm• however: cf. late-bound methods in OO

Static scoping in C

Code snippet

#include <s td i o . h>

int i = 1 ;void f (void ) {

p r i n t f ( "%d\n" , i ) ;}

void main (void ) {int i = 2 ;f ( ) ;return 0 ;

}

which value of i is printed then?


Dynamic binding example

1 void Y () {2 int i ;3 void P( ) {4 int i ;5 . . . ;6 Q( ) ;7 }8 void Q(){9 . . . ;

10 i = 5 ; // which i i s meant?11 }12 . . . ;1314 P( ) ;15 . . . ;16 }

for dynamic binding: the one from line 4

Static or dynamic?

TEX

\def\ a s t r i n g {a1}\def\x{\ a s t r i n g }\x{

\def\ a s t r i n g {a2}\x

}\x\bye

LATEX

\documentclass { a r t i c l e }\newcommand{\ a s t r i n g }{a1}\newcommand{\x}{\ a s t r i n g }\begin{document}\x{

\renewcommand{\ a s t r i n g }{a2}\x

}\x\end{document}

emacs lisp (not Scheme)

( s e tq a s t r i n g " a1 " ) ; ; ``assignment ' '(defun x ( ) a s t r i n g ) ; ; d e f i n e `` v a r i a b l e x ' '( x ) ; ; read va lue( let ( ( a s t r i n g " a2 " ) )

( x ) )


Again, it’s very easy to check by invoking TEX or LATEX, or firing off emacs and evaluate the lispsnippet in a buffer, for instance. As for Scheme: Scheme is a Lips-dialect, actually the first (or atleast the first significant and most prominent ) one to upon with static or lexical scoping. Originally“Lisp” used dynamic bindin.g Lisp was way ahead of its time in some ways, actually revolutionary(higher-order functions, reflecting, garbage collecton), one should not forget that it was conceived(and implemented!) in the 50ies (at MIT). Now, resources for Lisp stretched the hardware of theday. Note that the very earliest machines did not even have hardware support for stack pointers(Borroughs machines at the beginning of the 60ies where the first that pioneered that) which madeeven recursion (which uses stack) a costly luxury. And Lisp supported higher-order functions fromthe start. It too some time (and conceptual and hardware advances) until major lexically-scopedvariant of Lisp could establish itself (known as Scheme). Scheme also supports dynamic scoping(though frowns upon it). More “classic” Lisp dialects (like Common Lisp) also support lexicalscoping besides dynamically scoping. Emacs Lisp is one well-known Lisp-dialect based on dynamicscoping, though as of emacs version 24, also lexical scoping is supported. It may or may not bea coincidence, that the key person behind emacs is Richard Stallman, the last of the “last truehacker” from the MIT school of hackers. McCarthy and Minsky also at the MIT (earlier pioneersthan Stallman), McCarthy is central behind Lisp, and actually also coined the term AI (MIT wasthe focal point of early AI). For emacs, Stallman is central in kicking it off, hacking its initialversions (with others), mentoring it through many years and giving a spiritual (or idealogical?)background as part of a larger free software movement),

Static binding is not about “value”

• the “static” in static binding is about– binding to the declaration / memory location,– not about the value

• nested functions used in the example (Go)• g declared inside f

package mainimport ( " fmt " )

var f = func ( ) {var x = 0var g = func ( ) { fmt . P r i n t f ( " x = %v" , x )}x = x + 1

{var x = 40 // l o c a l v a r i a b l eg ( )fmt . P r i n t f ( " x = %v" , x )}

}func main ( ) {

f ( )}

Static binding can become tricky


var f = func ( ) ( func ( int ) int ) {var x = 40 // l o c a l v a r i a b l evar g = func ( y int ) int { // nes ted func t ion

return x + 1}

252 6 Symbol tables6.5 Symbol tables as attributes in an AG

x = x+1 // update xreturn g // func t ion as re turn va lue

}

func main ( ) {var x = 0var h = f ( )fmt . Pr in t ln (x )var r = h (1)fmt . P r i n t f ( " r = %v" , r )

}

• example uses higher-order functions

As said, the example uses higher-order functions. In particular, the function f gives back somefunction, namely the function g, and not only that: function g is defined inside f, in particular,g is defined inside the scope of f. And finally, the nested function g refers to x, which is alsodefined inside f. Now the problem is that the scope of f lives longer than the body of f itself.We come to that problem also later, when dealing with run-time environment. In many languages,one important part of the RTE is the run-time stack, or call stack. It turns out, that in situationslike the ones illustrated here, a stack is no longer good enough for providing lexical scoping.

6.5 Symbol tables as attributes in an AG

Nested lets in ocaml

let x = 2 and y = 3 in( let x = x+2 and y =

( let z = 4 in x+y+z )in pr int_int (x+y ) )

• simple grammar (using , for “collateral” = simultaneous declarations)

S → expexp → ( exp ) | exp + exp | id | num | let dec - list in exp

dec - list → dec - list , decl | decldecl → id = exp

1. no identical names in the same let-block2. used names must be declared3. most-closely nested binding counts4. sequential (non-simultaneous) declaration ( 6= ocaml/ML/Haskell . . . )

let x = 2 , x = 3 in x + 1 (∗ no , d u p l i c a t e ∗)

let x = 2 in x+y (∗ no , y unbound ∗)

let x = 2 in ( let x = 3 in x ) (∗ d e c l . wi th 3 counts ∗)

let x = 2 , y = x+1 (∗ one a f t e r the o ther ∗)in ( let x = x+y ,

y = x+yin y )

6 Symbol tables6.5 Symbol tables as attributes in an AG 253

Goal

Design an attribute grammar (using a symbol table) specifying those rules. Focus on: error at-tribute.

Attributes and ST interface

symbol attributes kindexp symtab inherited

nestlevel inheritederr synthesis

dec - list, decl intab inheritedouttab synthesizednestlevel inherited

id name injected by scanner

Symbol table functions

• insert(tab,name,lev): returns a new table• isin(tab,name): boolean check• lookup(tab,name): gives back level• emptytable: you have to start somewhere• errtab: error from declaration (but not stored as attribute)

As for the information stored and especially for the look-up function: Realistically, more info wouldbe stored, as well, for instance types etc.

Attribute grammar (1): expressions

254 6 Symbol tables6.5 Symbol tables as attributes in an AG

• note: expressions in let’s can introduce scopes themselves!• interpretation of nesting level: expressions vs. declarations5

Attribute grammar (2): declarations

Final remarks concerning symbol tables

• strings as symbols i.e., as keys in the ST: might be improved• name spaces can get complex in modern languages,• more than one “hierarchy”

– lexical blocks– inheritance or similar– (nested) modules

• not all bindings (of course) can be solved at compile time: dynamic binding• can e.g. variables and types have same name (and still be distinguished)• overloading (see next slide)

Final remarks: name resolution via overloading

• corresponds to “in abuse of notation” in textbooks• disambiguation not by name, but differently especially by “argument types” etc.• variants :

– method or function overloading– operator overloading– user defined?

5I would not have recommended doing it like that (though it works)

6 Symbol tables6.5 Symbol tables as attributes in an AG 255

i + j // i n t e g e r a d d i t i o nr + s // rea l−a d d i t i o n

void f ( int i )void f ( int i , int j )void f (double r )

256 7 Types and type checking

Types and type checkingChapter

Whatis it

about?Learning Targets of this Chapter1. the concept of types2. specific common types3. type safety4. type checking5. polymorphism, subtyping and other

complications

Contents

7.1 Introduction . . . . . . . . . . 2567.2 Various types and their rep-

resentation . . . . . . . . . . 2597.3 Equality of types . . . . . . . 2697.4 Type checking . . . . . . . . . 275

7.1 Introduction

This chapter deals with “types”. Since the material is presented as part of the static analysis (orsemantic analysis) phase of the compiler, we are dealing mostly with static aspects of types (i.e.,static typing).

The notion of “type” is very broad and has many different aspects. The study of “types” is aresearch field in itself (“type theory”). In some way, types and type checking is the very essenceof semantic analysis, insofar that types can be very “expressive” and can be used to representvastly many different aspects of the behavior of a program. By “more expressive” I mean typesthat express much more complex properties or attributes than the ones standard programmers arefamiliar with: booleans, integers, structured types, etc. When increasing the “expressivity”, typesmight not only capture more complex situations (like types for higher-order functions), but alsoaspects, not normally connected with types, like for instance: bounds on memory usage, guaranteesof termination, assertions about secure information flow (like no information leakage), and manymore.

This year (2020), there seem to be some Haskell-disciples in the course. Haskell’s type systemis rather expression even in its core version. Language extension allow to get serious steps inthe direction of type-level programming and programming with dependent types. This leads tosystems where type inference and other questions become undecidable and the type system startsresembling a specification of the program behavior (expessing invariants etc). Indeed, a typesystem fully embracing dependent types is a form of combining computation for programming andlogic (for specification) in a common framework.

As a final random example: a language like Rust is known for its non-standard form of memorymanagement based on the notion of ownership to a piece of data. Ownership tells who has the rightto access the data when and how, and that’s important to know as as simultaneous write accessleads to trouble. Regulating ownership can and has been formulated by corresponding “ownershiptype systems” where the type expresses properties concerning ownership.

That should give a feeling that, with the notion of types such general, the situation is a bit aswith “attributes” and attribute grammars: “everything” may be an attribute since an attribute

https://en.wikipedia.org/wiki/Type_theory

7 Types and type checking7.1 Introduction 257

is nothing else than a “property”. The same holds for types. With a loose interpretation likethat, types may represent basically all kinds of concepts: like, when interested in property “A”,let’s intoduce the notion of “A”-types (with “A” standing for memory consumption, ownership, andwhat not). But still: studying type systems and their expressivity and application to programminglanguages seems a much broader and deeper (and more practical) field than the study of attributegrammars. By more practical, I mean: while attribute grammars certainly have useful applications,stretching them to new “non-standard” applications may be possible, but it’s, well, stretching it.1Type systems, on the other hand, span more easily form very simple and practical usages to veryexpressive and foundational logical system.

In this lecture, we keep it more grounded and mostly deal with concrete, standard (i.e., not veryesoteric) types. Simple or “complicated” types, there are at least two aspects of a type. One is,what a user or programmer sees or is exposed to. The second one is the inside view of the compilerwriter. The user may be informed that it’s allowed to write x + y where x and y are both integers(carrying the type int), or both strings, in which case + represents string addition. Or perhapsthe language even allows that one variable contains a string and the other an integer, in whichcase the + is still string concatenation, where the integer valued operand has to be converted toits string representation. The compiler writer needs then to find representations in memory forthose data types (ultimately in binary form) that actually realize the operations described aboveon an abstract level. That means choosing an appropriate encoding, choosing the right amount ofmemory (long ints need more space than short ints, etc, perhaps even depending on the platform),and making sure that needed conversions (like from integers to string) actually are done in thecompiled code (most likely arranged statically). Of course, the programmer does not want to knowthose details, he typically could not care less, for instance, whether the machine architecture is“little-endian” or “big-endian” (see https://en.wikipedia.org/wiki/Endianness). Butthe compiler writer will have to care when writing the compiler itself to represent or encode whatthe programmer calls “an integer” or “a string”. So, apart from the more esoteric and advanced rolestypes play in programming languages, perhaps the most fundamental role is that of abstraction:to shield the programmer from the dirty details of the actual representation.

Types are a central abstraction for programmers.

Abstraction in the sense of hiding underlying representional details.2

The lecture will have some look at both aspects of type systems. One is the representationalaspect. That one is more felt in languages like C, which is closer to the operating system andto memory in hardware than languages that came later. Besides that, we will also more look attype system as specification of what is allowed at the programmer’s level (“is it allowed to do a +on an a value of integer type and of string type?”), i.e., how to specify a type system in aprogramming language independent from the question how to choose proper lower-level encodingsthat the abstraction specified in the type system.

General remarks and overview

• Goal here:– what are types?

1That’s at least my slightly biased opinion.2Beside that practical representational aspect, types are also an abstraction in the sense that they canbe viewed as the “set” of all the values of that given type. Like int represents the set of all integers.Both views are consistent as all members of the “set” int are consistently represented in memory andconsistently treated by functions operating on them. That “consistency” allows us as programmersto think of them as integers, and forget about details of their representation, and it’s the task ofthe compiler writer, to reconcile those two views: the low-level encoding must maintain the high-levelabstraction.

https://en.wikipedia.org/wiki/Endianness

258 7 Types and type checking7.1 Introduction

– static vs. dynamic typing– how to describe types syntactically?– how to represent and use types in a compiler?

• coverage of various types– basic types (often predefined/built-in)– type constructors– values of a type– type operators– representation at run-time– run-time tests and special problems (array, union, record, pointers)

• specification and implementation of type systems/type checkers• advanced concepts

Why types?

• crucial, user-visible abstraction describing program behavior• one view: type describes a set of (mostly related) values• static typing: checking/enforcing a type discipline at compile time• dynamic typing: same at run-time, mixtures possible• completely untyped languages: very rare to non-existant, types were part of PLs from the

start.

Milner’s dictum (“type safety”)

Well-typed programs cannot go wrong!

• strong typing:3 rigorously prevent “misuse” of data• types useful for later phases and optimizations• documentation and partial specification

In contrast to (standard) types: many other abstractions in SA (like the control-flow graph or dataflow analysis and others) are not directly visible in the source code. However, in the light of theintroductory remarks that “types” can capture a very broad spektrum of semantic properties ofa language if one just makes the notion of type general enough (“ownership”, “memory consump-tion”), it should come as no surprise that one can capture data flow in appropriately complex typesystems, as well. . .

Besides that: there are not really any truly untyped languages around, there is always somediscipline (beyond syntax) on what a programmer is allowed to do and what not. Probably theanarchistic recipe of “anything (syntactically correct) goes” tends to lead to disaster. Note that“dynamically typed” or “weakly typed” is not the same as “untyped”.

Types: in first approximation

Conceptually

• semantic view: set of values plus a set of corresponding operations• syntactic view: notation to construct basic elements of the type (its values) plus “procedures”

operating on them3Terminology rather fuzzy, and perhaps changed a bit over time.

7 Types and type checking7.2 Various types and their representation 259

• compiler implementor’s view: data of the same type have same underlying memory repre-sentation

further classification:

• built-in/predefined vs. user-defined types• basic/base/elementary/primitive types vs. compound types• type constructors: building more compex types from simpler ones• reference vs. value types

7.2 Various types and their representation

Some typical base types

base typesint 0, 1, . . . +,−, ∗, / integersreal 5.05E4 . . . +,-,* real numbersbool true, false and or (|) . . . booleanschar ’a’ characters...

• often HW support for some of those (including some of the op’s)• mostly: elements of int are not exactly mathematical integers, same for real• often variations offered: int32, int64• often implicit conversions and relations between basic types

– which the type system has to specify/check for legality– which the compiler has to implement

Some compound types

compound typesarray[0..9] of real a[i+1]list [], [1;2;3] concatstring "text" concat . . .struct / record r.x. . .

• mostly reference types• when built in, special “easy syntax” (same for basic built-in types)

– 4 + 5 as opposed to plus(4,5)– a[6] as opposed to array_access(a, 6) . . .

• parser/lexer aware of built-in types/operators (special precedences, associativity, etc.)• cf. functionality “built-in/predefined” via libraries

Being a “conceptual” view means, it’s about the “interface”, it’s an abstract view of how one canmake use of members of a type. It not about implementation details, like “integers are 2 bytewords in such-and-such representation”. See also the notion of abstract data type on the nextslide.

260 7 Types and type checking7.2 Various types and their representation

Abstract data types

• unit of data together with functions/procedures/operations . . . operating on them• encapsulation + interface• often: separation between exported and internal operations

– for instance public, private . . .– or via separate interfaces

• (static) classes in Java: may be used/seen as ADTs, methods are then the “operations”

ADT begininteger i ;real x ;int proc t o t a l ( int a ) {

return i ∗ x + a // or : `` t o t a l = i ∗ x + a ' '}

end

Type constructors: building new types

• array type• record type (also known as struct-types)• union type• pair/tuple type• pointer type

– explict as in C– implict distinction between reference and value types, hidden from programmers (e.g.

Java)• signatures (specifying methods / procedures / subroutines / functions) as type• function type constructor, incl. higher-order types (in functional languages)• (names of) classes and subclasses• . . .

Basically all languages support to build more complex types from the basic one and ways to useand check them. Sometimes it’s not even very visible, for instance, one may already see strings ascompound. For instance in C, which takes a very implementation-centric view on types, explainsstrings as

one-dimensional array of characters terminated by a null character ’\0’

Of course, there is special syntax to build values of type string, writing "abc" as opposed tostring-cons(’a, string_cons(’b, ...)) or similar. . . This smooth support of workingwith strings may make them feel as if being primitive.

In the following we will have a look at a few of composed types in programming languages. TheCompila language of this year’s oblig supports records but also “names” of records. We will alsodiscuss the issue of “types as such” vs. “names of types” later (for instance in connection withthe question how to “compare types: when are they equal or compatible, what about subytping?etc.).

Arrays

Array type

array [< indextype >] of <component type>


• elements (arrays) = (finite) functions from index-type to component type• allowed index-types:

– non-negative (unsigned) integers?, from ... to ...?– other types?: enumerated types, characters

• things to keep in mind:– indexing outside the array bounds?– are the array bounds (statically) known to the compiler?– dynamic arrays (extensible at run-time)?

Integer-indexed arrays are typically a very efficient data structure, as they mirror the layout ofstandard random access memory and customary hardware.4 Indeed, contiguous random-accessmemory can be seen as one big array of “cells” or “words” and standard hardware supports fastaccess to to those cells by indirect addressing modes (like making use of an off-set from a baseaddress, even offset multiplied by a factor (which represents the size of the entries)). In the laterchapters about code generation, we will look a bit into different addressing modes of machineinstructions.

One and more-dimensional arrays

• one-dimensional: efficiently implementable in standard hardware (relative memory address-ing, known offset)

• two or more dimensions

array [ 1 . . 4 ] of array [ 1 . . 3 ] of realarray [ 1 . . 4 , 1 . . 3 ] of real

• one can see it as “array of arrays” (Java), an array is typically a reference type• conceptually “two-dimensional”- linear layout in memory (language dependent)

Records (“structs”)

struct {r e a l r ;int i ;

}

• values: “labelled tuples” (real× int)• constructing elements, e.g.

struct point { int x ; int y ; } ;struct point pt = { 300 , 42 } ;

struct point

• access (read or update): dot-notation x.i• implementation: linear memory layout given by the (types of the) attributes• attributes accessible by statically fixed offsets

4There exists unconventional hardware memory architectures which are not accessed via addresses, likecontent-addressable memory. Those don’t resemble “arrays”. They are a specialist niche, but haveapplications.

https://www.pagiamtzis.com/cam/camintro/


• fast access• cf. objects as in Java

Structs in C

The following is really not important, just some side remarks on esoteric, i.e., weird, aspects ofstructs in C: The definition, declaration etc. of struct types and structs in C is slightly confusing.

struct f oo { // foo i s c a l l e d a `` tag ' 'r e a l r ;int i

The foo is a tag, which is almost like a type, but not quite, at least as far as C is concerned (i.e.the definition of C distinguishes even it is not so clear why). Technically, for instance, the namespace for tags is different from that for types. Ignoring details, one can make use of the tag almostas if it were a type, for instance,

struct f oo b

declares the structure b to adhere to the struct type tagged by foo. Since foo is not a propertype, what is illegal is a declaration such as foo b. In general the question whether one shoulduse typedef in commbination with struct tags (or only typedef, leaving out the tag), seems amatter of debate. In general, the separation between tags and types (resp. type names) is a messy,ill-considered design. One should do better these days.

Tuple/product types

• T1 × T2 (or in ascii T_1 * T_2)• elements are tuples: for instance: (1, "text") is element of int * string• generalization to n-tuples:

value type(1, "text", true) int * string * bool(1, ("text", true)) int * (string * bool)

• structs can be seen as “labeled tuples”, resp. tuples as “anonymous structs”• tuple types: common in functional languages,• in C/Java-like languages: n-ary tuple types often only implicit as input types for proce-

dures/methods (part of the “signature”)

The two “triples” and their types touches upon an issue discussed later, namely when are twotypes equal (and related to that, whether or not the corresponding values (here the “triples”) areequal.

Union types (C-style again)

union {r e a l r ;int i

}


• related to sum types (outside C)• (more or less) represents disjoint union of values of “participating” types• access in C (confusingly enough): dot-notation u.i

Union types in C and type safety

• union types is C: bad example for (safe) type disciplines, as it’s simply type-unsafe, basicallyan unsafe hack . . .

Union type (in C):

• nothing much more than a directive to allocate enough memory to hold largest member ofthe union.

• in the example: real takes more space than int

• implementor’s (= low level) focus and memory allocation, not “proper usage focus” or assur-ing strong typing

⇒ bad example of modern use of types• better (type-safe) implementations known since⇒ variant record (“tagged”/“discriminated” union ) or even inductive data types

Inductive types are basically: union types done right plus possibility of “recursion”. On the nextslide, we discuss variant records from Pascal. They try to remedy (partly) the deficiency of C-stylerecords by adding as additional component some “discriminator”. This possibility for enhancedsecurity goes only half way, it’s still possible to subvert the type system. Inductive data typesalso allow recursive definitions, and can be used for pattern matching, an elegant form of “case-switching”.

Variant records from Pascal

record case i sRea l : boolean oftrue : ( r : real ) ;fa l se : ( i : integer ) ;

• “variant record”• non-overlapping memory layout5

• programmer responsible to set and check the “discriminator” self• enforcing type-safety-wise: not really an improvement :-(

Inductive types in ML and similar

• type-safe and powerful• allows pattern matching

I sRea l of r e a l | I s I n t e g e r of i n t

• allows recursive definitions ⇒ inductive data types:5Again, that’s an implementor-centric view, not a user-centric one.


type i n t_b in t r ee =Node of i n t ∗ i n t_b in t r ee ∗ b in t r e e

| Ni l

• Node, Leaf, IsReal: constructors (cf. languages like Java)• constructors used as discriminators in “union” types

type exp =Plus of exp ∗ exp

| Minus of exp ∗ exp| Number of i n t| Var of s t r i n g

Recursive data types in C

does not work

struct intBST {int va l ;int i sNu l l ;struct intBST l e f t , r i g h t ;

}

“indirect” recursion

struct intBST {int va l ;struct intBST ∗ l e f t , ∗ r i g h t ;

} ;

In Java: references implicit

class BSTnode {int va l ;BSTnode l e f t , r i g h t ;

• note: implementation in ML: also uses “pointers” (but hidden from the user)• no nil-pointers in ML (and NIL is not a nil-pointer, it’s a constructor)

Pointer types

• pointer type: notation in C: int*• “ * ”: can be seen as type constructor

int∗ p ;

• random other languages: ^integer in Pascal, int ref in ML• value: address of (or reference/pointer to) values of the underlying type


• operations: dereferencing and determining the address of an data item (and C allows “ pointerarithmetic ”)

var a : ^ integer (∗ p o i n t e r to an i n t e g e r ∗)var b : integer. . .a := &i (∗ i an i n t var ∗)

(∗ a := new i n t e g e r ok too ∗)b:= ^a + b

Implicit dereferencing

• many languages: more or less hide existence of pointers• cf. reference vs. value types often: automatic/implicit dereferencing

C r ;C r = new C( ) ;

• “sloppy” speaking: “ r is an object (which is an instance of class C /which is of type C)”,• slightly more precise: variable “ r contains an object. . . ”• precise: “variable r will contain a reference to an object”• r.field corresponds to something like “ (*r).field, similar in Simula

Programming with pointers

• “popular” source of errors• test for non-null-ness often required• explicit pointers: can lead to problems in block-structured language (when handled non-

expertly)• watch out for parameter passing• aliasing• null-pointers: “the billion-dollar-mistake”• take care of concurrency

Null pointer are generally attributed (actually including self-attributed) to Tony Hoare, famousfor many landmark contributions. He himself refers to the introduction of null pointers or nullreferences (1965 for ALGOL-W) as his billion dollar mistake. See also here, but the video seems nolonger to work, but there is some notes or rudimentary transscript. One can also consult Hoare’sTuring Award lecture (1980), where he talks about similar topics. Also the text of the lectureis available on the net. In the lecture, he interestingly mentions as the first and foremost designprinciple for the design of ALGOL resp. the corresponding compiler: security. So it’s not thatthe intention was to say “to hell with security, speed comes first”. From the text, though, it seemsthat he speaks about “security” of the compiler itself, in that it should never crash (= “. . . nocore dumps should ever be nessessary”).

Function variables

The following shows problems in situation when one can refer to more “powerful” things than“dead data”. So far, the data was all passive but, of course, also function or procedures need to bestored somewhere, ultimately it’s also a block of bits. Often, in traditional layouts, one thinks offunctions code residing in one portion of the memory, and data in a another (though in a sharedaddress space, in the traditional von Neumann architecture. In the so-called Harvard-architecture,

https://medium.com/@hinchman_amanda/null-pointer-references-the-billion-dollar-mistake-1e616534d485

https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare


the separation would be stricter). Anyway, there is no principle reason why variables should notrefer to functions, as well. That goes in the direction of higher-order function where the discinctionbetween data and code is completely blurred.

The example here is not based on higher-order programming, but uses just Pascal. What one can doin Pascal (as opposed to C) is nested function declarations and “returning” variables “containing”functions (referring to them). The problem, illustrated here (“escaping”), is something that onealso has to deal with for higher-order function. In a way, the lesson from this example is: Pascalhad this facility, but somehow did not deal with it properly. Dealing properly with it would haverequired closures, but Pascal did not do that.

program Funcvar ;var pv : Procedure ( x : integer ) ; (∗ procedur var ∗)

Procedure Q( ) ;var

a : integer ;Procedure P( i : integer ) ;begin

a:= a+i ; (∗ a def ' ed o u t s i d e ∗)end ;

beginpv := @P; (∗ `` return ' ' P ( as s i d e e f f e c t ) ∗)

end ; (∗ "@" dependent on d i a l e c t ∗)begin (∗ here : f r e e Pascal ∗)

Q( ) ;pv ( 1 ) ;

end .

Function variables and nested scopes

• tricky part here: nested scope + function definition escaping surrounding function/scope.• here: inner procedure “returned” via assignment to function variable• stack discipline of dynamic memory management?• related also: functions allowed as return value?

– Pascal: not directly possible (unless one “returns” them via function-typed referencevariables like here)

– C: possible, but nested function definitions not allowed• combination of nested function definitions and functions as official return values (and argu-

ments): higher-order functions• Note: functions as arguments less problematic than as return values.

For the sake of the lecture: Let’s not distinguish conceptually between functions and procedures.But in Pascal, a procedure does not return a value, functions do.

Function signatures

• define the “header” (also “signature”) of a function6

• in the discussion: we don’t distinguish mostly: functions, procedures, methods, subroutines.• functional type (independent of the name f): int→int

6Actually, an identfier of the function is mentioned as well.


Modula-2

var f : procedure ( integer ) : integer ;

C

int (∗ f ) ( int )

• values: all functions (procedures . . . ) with the given signature• problems with block structure and free use of procedure variables.

Escaping

1 program Funcvar ;2 var pv : Procedure ( x : integer ) ; (∗ procedur var ∗)34 Procedure Q( ) ;5 var6 a : integer ;7 Procedure P( i : integer ) ;8 begin9 a:= a+i ; (∗ a def ' ed o u t s i d e ∗)

10 end ;11 begin12 pv := @P; (∗ `` return ' ' P ( as s i d e e f f e c t ) ∗)13 end ; (∗ "@" dependent on d i a l e c t ∗)14 begin (∗ here : f r e e Pascal ∗)15 Q( ) ;16 pv ( 1 ) ;17 end .

• at the end of line 15: variable a no longer exists• possible safe usage: only assign to such variables (here pv) a new value (= function) at the

same blocklevel the variable is declared

As mentioned before function parameters less problematic than returning them (as with functionvariable), and the reason is that the stack-discipline in that case is still doable.

Classes and subclasses

Parent class

class A {int i ;void f ( ) { . . . }

}


Subclass B

class B extends A {int ivoid f ( ) { . . . }

}

Subclass C

class C extends A {int ivoid f ( ) { . . . }

}

• classes resemble records, and subclasses variant types, but additionally– visibility: local methods possible (besides fields)– subclasses– objects mostly created dynamically, no references into the stack– subtyping and polymorphism (subtype polymorphism): a reference typed by A can also

point to B or C objects

• special problems: not really many, nil-pointer still possible

The three classes from above illustrate subclassing (and in many object-oriented languages, con-nected to that, subtyping). Note that the classes are also names of types. What is also is illustratedis overriding as far as f is concerned. Inheritance is actually not illustrated, insofar that f as onlymethod involved is overridden, not inherited both in B and C. The methods f and the instancevariables i are treated differently as far as binding is concerned. That will be discussed next. Inthe slides we use rA to refer to a variable of static type/class A.

Access to object members: late binding

• notation rA.i or rA.f()• dynamic binding, late-binding, virtual access, dynamic dispatch . . . : all mean roughly the

same• central mechanism in many OO language, in connection with inheritance

Virtual access rA.f() (methods)

“deepest” f in the run-time class of the object, rA points to

• remember: “most-closely nested” access of variables in nested lexical block• Java:

– methods “in” objects are only dynamically bound (but there are class methods too)– instance variables not, neither static methods “in” classes.

7 Types and type checking7.3 Equality of types 269

Example: fields and methods

public class Shadow {public stat ic void main ( St r ing [ ] a rgs ){

C2 c2 = new C2 ( ) ;c2 . n ( ) ;

}}

class C1 {St r ing s = "C1" ;void m () {System . out . p r i n t ( this . s ) ; }

}

class C2 extends C1 {St r ing s = "C2" ;void n ( ) { this .m( ) ; }

}

The code is compilable Java code and can thus be tested. It is supposed to illustrate the discusseddifference in the treatment of fields and methods, as far as binding is concerned. While themechanism for methods (which are late or dynamically bound) is called overriding, the similar(but of course not same) situation for fields (which are statically bound) is called shadowing. Onemay also see it like that: fields are treated as if they were static methods.

Diverse notions

• Overloading– common for (at least) standard, built-in operations– also possible for user defined functions/methods . . .– disambiguation via (static) types of arguments– “ad-hoc” polymorphism– implementation:

∗ put types of parameters as “part” of the name∗ look-up gives back a set of alternatives

• type-conversions: can be problematic in connection with overloading• (generic) polymporphism

swap(var x,y: anytype)

7.3 Equality of types

Classes as types

• classes = types? Not so fast• more precise view:

– design decision in Java and similar languages (but not all/even not all class-basedOOLs): that class names are used in the role of (names of) types.

• other roles of classes (in class-based OOLs)– generator of objects (via constructor, again with the same name)7

7Not for Java’s static classes etc, obviously.

270 7 Types and type checking7.3 Equality of types

– containing code that implements the instances

C x = new C()

Example with interfaces

interface I1 { int m ( int x ) ; }interface I2 { int m ( int x ) ; }class C1 implements I1 {

public int m( int y ) {return y++; }}class C2 implements I2 {

public int m( int y ) {return y++; }}

public class Noduck1 {public stat ic void main ( St r ing [ ] arg ) {

I1 x1 = new C1 ( ) ; // I2 not p o s s i b l eI2 x2 = new C2 ( ) ;x1 = x2 ; // ???

}}

Analogous when using classes in their roles as types

When are 2 types “equal”?

• type equivalence• surprisingly many different answers possible• implementor’s focus (deprecated): type int and short are equal, because they “are” both

2 byte• type checker must often decide such equivalences• related to a more fundamental question: what’s a type?

Example: pairs of integers

type pai r_of_ints = in t ∗ i n t ; ;let x : pa i r_of_ints = ( 1 , 4 ) ; ;

Questions

• Is “the” type of (values of) x pair_of_ints, or• the product type int * int , or• both, as they are equal, i.e., pair_of_int is an abbreviation of the product type (type

synonym)?

For this particular language (ocaml), the piece of code is correct: the pair (1,4) is of type int* int and of type pair_of_ints.


Structural vs. nominal equality

a, b

var a , b : r ecordint i ;double d

end

c

var c : r ecordint i ;double d

end

typedef

typedef idRecord : r ecordint i ;double d

end

var d : idRecord ;var e : idRecord ; ;

what’s possible?

a := c ;a := d ;

a := b ;d := e ;

Types in the AST

• types are part of the syntax, as well• represent: either in a separate symbol table, or part of the AST

Record type

r ecordx : po in t e r to real ;y : array [ 1 0 ] of int

end


Structural equality

Types with names

var-decls → var-decls ; var-decl | var-declvar-decl → id : simple-type-exp

type-decls → type-decls ; type-decl | type-decltype-decl → id = type-exptype-exp → simple-type-exp | structured-type

simple-type-exp → simple-type | id identifierssimple-type → int | bool | real | char | void

structured-type → array [ num ] : simple-type-exp| record var-decls end| union var-decls end| pointerto simple-type-exp| proc ( type-exps ) simple-type-exp

type-exps → type-exps , simple-type-exp| simple-type-exp

Name equality

• all types have “names”, and two types are equal iff their names are equal• type equality checking: obviously simpler


• of course: type names may have scopes. . . .

Type aliases

• languages with type aliases (type synonyms): C, Pascal, ML . . . .• often very convenient (type Coordinate = float * float)• light-weight mechanism

type alias; make t1 known also under name t2

t2 = t1 // t2 i s the ``same type ' ' .

• also here: different choices wrt. type equality

Type aliases: different choices

Alias, for simple types

t1 = int ;t2 = int ;

• often: t1 and t2 are the “same” type

Alias of structured types

t1 = array [ 1 0 ] of int ;t2 = array [ 1 0 ] of int ;t3 = t2

• mostly t3 6= t1 6= t2

The upshot of the “example” is: even within one language, it may be that different rules applywhen it comes to different kinds of types. Perhaps for synonyms of basic types (like int), theequality “carries over” but for more complex one (like arrays in the illustration), it does not.

7 Types and type checking7.4 Type checking 275

7.4 Type checking

Type checking of expressions (and statements)

• types of subexpressions must “fit” to the expected types the contructs can operate on• type checking: top-down and bottom-up task⇒ synthesized attributes, when using AGs• Here: using an attribute grammar specification of the type checker

– type checking conceptually done while parsing (as actions of the parser)– more common: type checker operates on the AST after the parser has done its job

• type system vs. type checker– type system: specification of the rules governing the use of types in a language, type

discipline– type checker: algorithmic formulation of the type system (resp. implementation thereof)

Synthesized attributes

When drawing the parallel that type checking is a buttom-up (“synthesized”) task, that is onlyhalf of the picture. The slide focuses on type checking if expresions (and statements). When itcomes to declarations (i.e., declaring a type for a variable, for instance), that part corresponds moreto “inherited” attributes. Remember that one standard way of implementing the association ofvariables (“symbols”) with (here) types (which can be seen as an “attribute”) are symbol tables.

Overloading

In case of (operator) overloading: that may complicate the picture slightly. Operators are selecteddepending on the type of the subexpressions. There will be some remarks concerning overloadinglater.

As said on the slides, the type checker mostly nowadays would work after the parser is finished,that means on the abstract syntax tree. One can, however, use grammars as specification of thatabstract syntax tree as well, i.e., as a “second” grammar besides the grammar for concrete parsing,and that’s then the grammar the type checker works on.

Grammar for statements and expressions

program → var-decls ; stmtsvar-decls → var-decls ; var-decl | var-declvar-decl → id : type-exptype-exp → int | bool | array [ num ] : type-exp

stmts → stmts ; stmt | stmtstmt → if exp then stmt | id := expexp → exp + exp | exp or exp | exp [ exp ]

276 7 Types and type checking7.4 Type checking

Type checking as semantic rules

More “modern” presentation

• representation as derivation rules• Γ: notation for symbol table

– Γ(x): look-up– Γ, x : T : insert

• more compact representation• one reason: “errors” left implicit.

The following formalizes basically the same type system as the one before with attribute grammars.It uses a style of representation, which borrows from “logics”, capturing the type system as a setof derivation rules. It’s a form of presentation often employed specifying type system of a complexnature. It’s not a coincidence that such presentations resemble logical derivations. There are deepconnections between (mostly intuitionstic or constructive) logics and type systems, but that goesbeyond this lecture.

The rules are to be read as follows: There are premises (above the horizonalt line) and oneconclusion (below the horizonal line). The derivable “assertions” are of the form Γ ` p : T (thoseare also called judgements sometimes), and they are to be read as follows: given the content of Γ asassumptions or hypothesis, then progam p is of type T . So, the rules specify how one can derive suchjudgements from other judgements. That may directly be translated into a algorithm, or may notbe used as algorithm directly, depending on the way the rules are formulated. Typically, when the

7 Types and type checking7.4 Type checking 277

language and the type system is complex, one may specify well-typedness in such a manner, withoutthe rules immediatedly translatable to a type checker, or maybe not at all, insofar one may havespecified an undecidable typing relation. Problematic things maybe polymorphism. A derivationsystem simply says, p is of a type T if, with the given rules, one can derive the correspondingjudgment, i.e., if there exists a derivation. It does not per se require, that the deriviation isunique, or that p may not have other types (in which case the type system is polymorphic). Butthat’s fine, it’s not directly an algorithm, it may be seen as a specification.

Note; the way we presented the attribute grammars, we can’t allow ourselves such a relaxedattitude, being happy if there is one solution among different possible ones. Attribute grammarsrequire one definite solution, no non-determinism or cycles or undefined situations allowed. That(among other reasons) makes it often less straightforward to use for specifying a type system. Oneaspect where it’s also visible is: in the attribute grammar, we explictly had to specify (in a not tooelegant way) error-situations. The rules here don’t do that. For instance, in the treatment of theconditionals, it’s required that the expression is a boolean. If it should be the case that it’s nota boolean, there is no rule that covers that situation, which means, the well-typedness judgmentfor a program containing such a situation is not derivable. Which means, it’s not well-typed andcontains thereby a type error.

A concrete type checker would have to produce a meaningful type error message in that alternativescenario, but that’s supressed. There core of the type system is focusing on the positive cases,leaving the type errors implicit and leaving it up to the implementor to figure out how to deal withuncovered situations. Similar relaxedness applies to rules that would include non-determinism:the implementar has to figure out how to deal with it, i.e., how to turn the specification in analgorithm. The concrete type system here is so simple (monomorphic) that the rules are basicallyan algorithm already.

Type checking (expressions)

Γ(x) = TTE-Id

Γ ` x : TTE-True

Γ ` true : boolT-False

Γ ` false : bool

TE-NumΓ ` n : int

Γ ` exp2 : array_of T Γ ` exp3 : intTE-Array

Γ ` exp2 [ exp3 ] : T

Γ ` exp1 : bool Γ ` exp2 : boolTe-Or

Γ ` exp1 or exp2 : bool

Γ ` exp1 : int Γ ` exp2 : intTE-Plus

Γ ` exp1 + exp2 : int

278 7 Types and type checking7.4 Type checking

Declarations and statements

Γ, x :int ` rest : okTD-Int

Γ ` x : int; rest : ok

Γ, x : bool ` rest : okTD-Bool

Γ ` x : bool; rest : ok

Γ ` num :int Γ(type-exp) = T

Γ, x : array num of T ` rest : okTD-Array

Γ ` x : array [ num ] : type-exp ; rest : ok

Γ ` x : T Γ ` exp : TTS-Assign

Γ ` x := exp : ok

Γ ` exp : bool Γ ` stmt : okTS-If

Γ ` if exp then stmt : ok

Γ ` stmt1 : ok Γ ` stmt2 : okTS-Seq

Γ ` stmt1 ; stmt2 : ok

8 Run-time environments 279

Run-time environmentsChapter

Whatis it

about?

Learning Targets of this Chapter1. memory management2. run-time environment3. run-time stack4. stack frames and their layout5. heap

Contents

8.1 Intro . . . . . . . . . . . . . . 2798.2 Different layouts . . . . . . . 2838.3 Static layout . . . . . . . . . 2838.4 Stack-based runtime envi-

ronments . . . . . . . . . . . 2858.5 Stack-based RTE with

nested procedures . . . . . . . 2988.6 Functions as parameters . . . 3038.7 Parameter passing . . . . . . 3088.8 Virtual methods in OO . . . 3128.9 Garbage collection . . . . . . 316

8.1 Intro

The chapter covers different aspects of the run-time environment of a language. The RTE refersto the design, organization and implementation of basically how to arrange the memory and howto access it at run-time. It’s basically to maintain the abstractions offered by the implementedprogramming language: the language speaks about variables and scopes, but ultimately, whenrunning, the data is words or sequences of bits, somewhere in the memory, and the data must beaddresseed adequatly. “Abstractions” that need to be taken care of (i.e., code must be generatedfor that) include variables inside scopes, static and dynamic memory allocation, parameter passing,garbage collection. The most important control abstraction in languages is that of a “procedure”.Connected to that is the run-time stack.

280 8 Run-time environments8.1 Intro

Static & dynamic memory layout at runtime

code area

global/static area

stack

free space

heap

Memory

typical memory layout: for languages (as nowadays basically all) with

• static memory• dynamic memory:

– stack– heap

The picture represents schematically a typical layout of the memory associated with one (single-threaded) program under execution. At he highest level, there is a separation between “control”and “data” of the program. The “control” of a program is program code itself, in compiled form, ofcourse, the machine code. The rest is the “data” the code operates on. Often, a strict separationbetween the two parts is enforced, even with the help of the hardware and/or the operating system.In principle, of course, the machine code is ultimately also “just bits”, so conceptually the runningprogram could modify the code section as well, leading to “self-modifying” code. That’s seen asa no-no, and, as said, measures are taken that this does not happen. The generated code is notonly kept immutable, it’s also treated mostly as static (for instance as indicated in the picture):the compiler generates the code, decides on how to arrange the different parts of the code, i.e. todecide which code for which function comes where. Typically, as indicated at the picture, all codeis grouped together into one big adjacent block of memory, which is called the code area.

The above discussing about the code area mentions that the control part of a program is structuredinto procedures (or functions, methods, subroutines . . . , generally one may use the term callableunit). That’s a reminder that perhaps the single most important abstraction (as far as the controlflow goes) of all but the lowest level languages is function abstraction: the ability to build “callableunits” that can be reused at various points in a program, in different contexts, and with differentarguments. Of course they may be reused not just by various points in one compiled program, butby different programs (maybe even at the same time, in a multi-process environment). An collectionof such callable units, arranged coherently and in a proper manner is, of course, a library.

The static placement of callable units into the code segment may remind us of functions as abstrac-tion a programming mechanism, but it’s not all that’s needed to actually provide, i.e., implementthat mechanism. At run-time, making use of a procedure means calling it and, when the pro-cedure’s code has executed till completion, returning from it. Returing means that that controlcontinues at the point where the call originated (maybe not exactly at that point, but “immediatelyafterwards”). This call-and-return behavior is at the core of realizing the procedure abstraction.Calling a procedure can be seen as a jump (JMP) and likewise the return is nothing else than ex-ecuting an according jump instruction. Executing a jump does nothing else than setting programpointer to address given as argument of the instruction (which in the typical arrangement fromthe picture is supposed to be an address in the code segment). Jumps are therefore rather simple

8 Run-time environments8.1 Intro 281

things, in particular, they are unaware of the intended call-return discipline. As a side remark:the platform may offer variations of the plain jump instruction (like jump-to-subroutine andreturn-from-subroutine, JTS and RTS or similar). That offer more “functionality” thathelps realizing the procedure call-return discipline, but ulitmately, they are nothine else than aslighter more fancy form of jump, and the basic story remain: on top of hardware supported jumps,one has to arrange steps that, at run-time, realize the call and return behavior. That needs toinvolve the data area of the memory (since the code area is immutable). To the very least: areturn from a procedure needs to know where to return to (since it’s just a jump). So, when callinga function, the run-time system must arrange to remember where to return to (and then, whenthe time comes to actually return, look up that return address and us it for the jump back). Ingeneral, in all but the simplest languages, calls can be nested, i.e., a function being called can inturn call another function. In that nested situation procedures are executed LIFO fashion: theprocedure called last is returned from first. That means, we need to arrange the rememberedreturn addresses, one for each procedure call, in the form of a stack. The run-time stack is onekey ingredient of the run-time system for many language. It’s part of the dynamic portion of thedata memory and separate in the picture from the other dynamic memory part, the heap, from agulf of unused memory. In such an arrangement, the stack could grow “from above” and the heap“from below” (other arrangements are of course possible, for instance not having heap and stackcompete for the same dynamic space, but each one living with an upper bound of their own).

So far we have discussed only the bare bones of the run-time environment to realize the procedureabstraction (the heap may be discussed later): in all by the very simplest settings, we need toarrange to maintain a stack for return addresses and manpulate the stack properly at run-time. Ifwe had a trivial language, where function calls cannot be nested, we could do without a stack (orhave a stack of maximal length 1, which is not much of a stack). In a setting without recursion(which we discuss also later), also similar simplifications are possible, and one could do without aofficial stack (though the call/return would still be executed under LIFO discipline, of course).

But beside those bare-bones return-address stack, the procedure abstraction has more to offer tothe programmer than arranging a call/return execution of the control. What has been left outof the picture, which concentrated on the control so far, is the treatment of data, in particularprocedure local data, so the question is related to how to realize at run-time the scoping rules thatgovern local data in the face of procedure calls. Related to that is that of procedure parametersand parameter passing. A procedure may have it’s own local data, but als receives data uponbeing called as arguments. Indeed, the real power if the procedure abstraction not just relies oncode (control) being available for repeated exection, it owes its power on equal parts that it canbe executed variously on different arguments. Just relying on global variables and the fact thatcalling a function in different contexts or situations will give the procedure different states forsome global values provides flexibility, but it’s an undignified attempt to achieve something likeparameter passing. All modern languages support syntax that allows the user to be explicit aboutwhat is considered the input of a procedures, it’s formal parameters. And again, arrangementshave to be made such that, at run-time the parameter passing is done properly. We will discussdifferent parameter-passing mechanisms later (the main being call-by-value, call-by-reference, andcall-by-name, as well as some bastard scheme of lesser importance). Furthermore, when callinga procedure, the body may contain variables which are not local, but refer to variables definedand given values outside of the procedure (and without officially being passed as parameter). Alsothat needs to be arranged, and the arrangement varies deping on the scoping rules of the language(static vs. dynamic binding).

Anyway, the upshot of all of this is: we need a stack that contains more than just the returnaddresses, proper information pertaining to various aspects of data are needed as well. As aconsequence, the single slots in the run-time stack become more complex; they are known asactivation record (since the call of a procedure is also known as its activation).

282 8 Run-time environments8.1 Intro

The chapter will discuss different indgredients and variations of the activation record, dependingon features of the language.

Modifying the control flow

Translated program code

code for procedure 1 proc. 1

code for procedure 2 proc. 2

⋮code for procedure n proc. n

Code memory

• code segment: almost always considered as statically allocated⇒ neither moved nor changed at runtime• compiler aware of all addresses of “chunks” of code: entry points of the procedures• but:

– generated code often relocatable– final, absolute adresses given by linker / loader

Activation record

space for arg’s (parameters)

space for bookkeepinginfo, including returnaddress

space for local data

space for local temporaries

Schematic activation record

• schematic organization of activation records/activation block/stack frame . . .• goal: realize

– parameter passing– scoping rules /local variables treatment– prepare for call/return behavior

8 Run-time environments8.2 Different layouts 283

• calling conventions on a platform

We will come back later to discuss possible designs for activation records in more detail, in thesection about stack-based run-time environments. Activiation records (also known as stack frames)are the elementary slots of call stacks, a central way to organize the dynamic memory for languageswith (recursive) procedures. There are also limitations of stack-based organizations, which we alsotouch upon.

8.2 Different layouts

In the following, we cover different layouts focusing first on the memory need in connection withprocedures (their local memory needs and other information to be maintained at run-time, to “makeit work”)

8.3 Static layout

Full static layout

code for main proc.

code for proc. 1

⋮code for proc. n

global data area

act. record of main proc.

activation record of proc. 1

⋮activation record of proc. n

• static addresses of all of the memory known to the compiler– executable code– variables– all forms of auxiliary data (for instance big constants in the program, e.g., string literals)

• for instance: (old) Fortran• nowadays rather seldom (or special applications like safety critical embedded systems)

Fortran example

284 8 Run-time environments8.3 Static layout

PROGRAM TESTCOMMONMAXSIZEINTEGER MAXSIZEREAL TABLE(10 ) ,TEMPMAXSIZE = 10READ ∗ , TABLE(1 ) ,TABLE(2 ) ,TABLE(3)CALL QUADMEAN(TABLE, 3 ,TEMP)PRINT ∗ ,TEMPEND

SUBROUTINE QUADMEAN(A,SIZE ,QMEAN)COMMONMAXSIZEINTEGERMAXSIZE,SIZEREAL A(SIZE) ,QMEAN, TEMPINTEGER KTEMP = 0.0IF ( (SIZE .GT.MAXSIZE) .OR. (SIZE .LT. 1 ) ) GOTO 99DO 10 K = 1 , SIZE

TEMP = TEMP + A(K)∗A(K)10 CONTINUE99 QMEAN = SQRT(TEMP/SIZE)

RETURNEND

Static memory layout example/runtime environment

MAXSIZEglobal area

TABLE (1)(2). . .(10)

TEMP

3

main’s act.record

A

SIZE

QMEAN

return address

TEMP

K

“scratch area”

Act. record ofQUADMEAN

The details of the syntax and the exact way the program runs are not so important. Also for thelayout on the next slides, the exact details don’t matter too much. Important is the discinctionbetween global variables and local ones, here for those for the “subroutine” (procedure). The localpart of the memory for the procedure is a first taste of an activation record. Later they will beorganized in a stack, and then they are also called stack frames but it’s the same thing. It’s spacethat will be used (at run-time) to fill the memory needs when calling the function (which is alsoknown as “activation” of the function). That needed space involves slots used to pass arguments(parameter passing) and space for local variables. Needed also is a slot where to save the returnaddress. Apart from the fact that exact details don’t matter: what is often typical (and willalso be typical in the lecture) is that the parameters are stored in slots before the return addressand the local variables afterwards. In a way, it’s a design choice, not a logical necessity, but it’scommon (also later). It’s often arranged like that, for reasons of efficiency. Later, the layoutof the activation records will need some refinement, i.e., there will be more than the mentioned

8 Run-time environments8.4 Stack-based runtime environments 285

information (parameters, local variables, return address) to be stored, when we have to deal withrecursion.

The back-arrows refer to parameter passing and the distinction between formal and actual param-eter. We come to parameter passing later.

Static memory layout example/runtime environment

in Fortan (here Fortran77)

• parameter passing as pointers to the actual parameters• activation record for QUADMEAN contains place for intermediate results, compiler calculates,

how much is needed.• note: one possible memory layout for FORTRAN 77, details vary, other implementations

exists as do more modern versions of Fortran

8.4 Stack-based runtime environments

Stack-based runtime environments

• so far: no(!) recursion• everything’s static, incl. placement of activation records• ancient and restrictive arrangement of the run-time envs• calls and returns (also without recursion) follow at runtime a LIFO (= stack-like) discipline

Stack of activation records

• procedures as abstractions with own local data⇒ run-time memory arrangement where procedure-local data together with other info (arrange

proper returns, parameter passing) is organized as stack.

• AKA: call stack, runtime stack• AR: exact format depends on language and platform

Situation in languages without local procedures

• recursion, but all procedures are global• C-like languages

Activation record info (besides local data, see later)

• frame pointer• control link (or dynamic link)1

• (optional): stack pointer• return address

1Later, we’ll encounter also static links (aka access links).

286 8 Run-time environments8.4 Stack-based runtime environments

The notion of static links menioned in the footnote is basically the same we encountered before,when discussing the design of symbol tables, in particular how to arrange them properly for nestedblocks and lexcical binding. Here (resp. shortly later down the road), the static links serve thesame purpose, only not linking up (parts of a ) symbol table, but activation records.

Euclid’s recursive gcd algo

#include <s td i o . h>

int x , y ;

int gcd ( int u , int v ){ i f ( v==0) return u ;

else return gcd (v , u % v ) ;}

int main ( ){ s can f ( "%d%d" ,&x,&y ) ;

p r i n t f ( "%d\n" , gcd (x , y ) ) ;return 0 ;

}

Stack gcd

x:15y:10global/static area

“AR of main”

x:15y:10

control link

return address

a-record (1st. call)

x:10y:5

control link

return address

a-record (2nd.call)

x:5y:0

control linkfp

return addresssp

a-record (3rd. call)

↓

• control link– aka: dynamic link– refers to caller’s FP

• frame pointer FP– points to a fixed location in the current a-record

• stack pointer (SP)– border of current stack and unused memory

• return address: program-address of call-site


Local and global variables and scoping

Code

int x = 2 ; /∗ g l o b . var ∗/void g ( int ) ; /∗ pro to type ∗/

void f ( int n){ stat ic int x = 1 ;

g (n ) ;x−−;

}

void g ( int m){ int y = m−1;

i f ( y > 0){ f ( y ) ;

x−−;g ( y ) ;

}}

int main ( ){ g (x ) ;

return 0 ;}

• global variable x• but: (different) x local to f• remember C:

– call by value– static lexical scoping

The code is artificial, it will later be used to illustrate the run-time stack in a simple setting. Beingcalled with 2 initially, there are only three activations of the 2 functions f and g altogether.

Activation records and activation trees

• activation of a function: corresponds to: call of a function• activation record

– data structure for run-time system– holds all relevant data for a function call and control-info in “standardized” form– control-behavior of functions: LIFO– if data cannot outlive activation of a function⇒ activation records can be arranged in as stack (like here)– in this case: activation record AKA stack frame

The two pictures illustrate the notion of activation tree (where in the gcd-case, is not much of atree as it’s linear. An activation of gcd calls itself at most once (actually, gcd is tail-recursive).


Activation record and activation trees

GCD

main()

gcd(15,10)

gcd(10,5)

gcd(5,0)

f and g example

main

g(2)

f(1)

g(1)

g(1)

Variable access and design of ARs

Layout g

• fp: frame pointer• m (in this example): parameter of g

Possible arrangement of g’s AR

• AR’s: structurally uniform per language (or at least compiler) / platform• different function defs, different size of AR⇒ frames on the stack differently sized• note: FP points

– not: “top” of the frame/stack, but– to a well-chosen, well-defined position in the frame


– other local data (local vars) accessible relative to that• conventions

– higher addresses “higher up”– stack “grows” towards lower addresses

The pictures use the following concvention. The show “pointers” point to the “bottom” of themeant slot. For example, fp points to the control link which has offset 0 from that pointer. Thereturn address given the slot below has a negative offset to that pointer. Different presentationmay make use of different graphical conventions. The graphical conventions are of course to bedistinguished from the “calling conventions” and the design of the activation record. One agreementin this layout is: the fp points to the control link, i.e., the memory (perhaps a specific register)corresponding to the frame pointer contains the address of the control link.

Layout for arrays of statically known size

Code

void f ( int x , char c ){ int a [ 1 0 ] ;

double y ;. .

}

name offsetx +5c +4a -24y -32

access of c and y

c : 4( fp )y : −32( fp )

access for a[i]

(−24+2∗ i ) ( fp )


Layout

The example makes some not unplausible assumptions on the size of the involved data. Theaddresses count 4 words, the character 1 the integers 2 words, the double 8. Notation like 4(fp)is meant as some ad-hoc syntax to designate the memory interpreting the content of fp as addressand add 4 words to it. We will later encounter different addressing modes (like indirect addressesetc). Except in very early times, Hardware gives support for more complex ways of accessing thememory, like support for specifying given offsets.

2 snapshots of the call stack

x:2x:1 (@f)static

main

m:2

control link

return address

y:1

g

n:1

control link

return address

f

m:1

control linkfp

return address

y:0sp

g

...


x:1x:0 (@f)static

main

m:2

control link

return address

y:1

g

m:1

control linkfp

return address

y:0sp

g

...

• note: call by value, x in f static

The picture on the slide refers to the simple, artificial C example involving procedures f and g,which was also used to illustrate the activation tree.

How to do the “push and pop”

• calling sequences: AKA as linking conventions or calling conventions• for RT environments: uniform design not just of

– data structures (=ARs), but also of– uniform actions being taken when calling/returning from a procedure

• how to do details of “push and pop” on the call-stack

E.g: Parameter passing

• not just where (in the ARs) to find value for the actual parameter needs to be defined, butwell-defined steps (ultimately code) that copies it there (and potentially reads it from there)

• “jointly” done by compiler + OS + HW• distribution of responsibilities between caller and callee:

– who copies the parameter to the right place– who saves registers and restores them– . . .

Steps when calling

• For procedure call (entry)1. compute arguments, store them in the correct positions in the new activation record of

the procedure (pushing them in order onto the runtime stack will achieve this)2. store (push) the fp as the control link in the new activation record3. change the fp, so that it points to the beginning of the new activation record. If there

is an sp, copying the sp into the fp at this point will achieve this.4. store the return address in the new activation record, if necessary5. perform a jump to the code of the called procedure.6. Allocate space on the stack for local var’s by appropriate adjustement of the sp

• procedure exit


1. copy the fp to the sp (inverting 3. of the entry)2. load the control link to the fp3. perform a jump to the return address4. change the sp to pop the arg’s

Steps when calling g

Before call

rest of stack

m:2

control link

return addr.fp

y:1

...sp

before call to g

Pushed m

rest of stack

m:2

control link

return addr.fp

y:1

m:1

...sp

pushed param.


Pushed fp

rest of stack

m:2

control link

return addr.fp

y:1

m:1

control link

...sp

pushed fp

Steps when calling g (cont’d)

Return pushed

rest of stack

m:2

control link

return addr.

y:1

m:1

control link

return addressfp

. . .sp

fp := sp,push return addr.


local var’s pushed

rest of stack

m:2

control link

return addr.

y:1

m:1

control link

return addressfp

y:0

...sp

alloc. local var y


Treatment of auxiliary results: “temporaries”

Layout picture

rest of stack

. . .

control link

return addr.fp

. . .

address of x[i]

result of i+j

result of i/ksp

new AR for f(about to be cre-ated)

...

• calculations need memory for intermediate results.• called temporaries in ARs.

x [ i ] = ( i + j ) ∗ ( i /k + f ( j ) ) ;

• note: x[i] represents an address or reference, i, j, k represent values• assume a strict left-to-right evaluation (call f(j) may change values.)• stack of temporaries.• [NB: compilers typically use registers as much as possible, what does not fit there goes into

the AR.]

The array example uses arrays indexed by integers. Integes are good (efficient) for array-offsets,so they act as “references”. In a way, calculations like that is a form of pointer arithmentic.That, however, is not the message of the slide. The message of the slide is, that the body of aprocedure may involve more complex operations than elementary additions etc. The computationsin the example are not really complex from programming perspective, but they are compound.Perhaps there may be hardware support for x + y, x-y, x+1 etc, but compound expressions areof course not natively supported. They have to be broken down to elementary calculations andthe intermediate results need to be stored somewhere. The memory entities for those intermediateresults are called temporaries. We will encounter them when talking about code generation (wherewe need to generate code that breaks down compound expressions into indivual steps). That comeslater, for the run-time enviromenent, the design of the activation record must provide enough spaceso be able to locally store those results.

The side remark says, that often, one tries to avoid putting all local temporaries inside the activa-tion record, as much as possible, one would like to use registers for that.


Variable-length data

Ada code

type Int_Vector i sarray (INTEGER range <>) of INTEGER;

procedure Sum( low , high : INTEGER;A: Int_Vector ) return INTEGER

i si : i n t e g e r

begin. . .

end Sum ;

• Ada example• assume: array passed by value (“copying”)• A[i]: calculated as @6(fp) + 2*i• in Java and other languages: arrays passed by reference• note: space for A (as ref) and size of A is fixed-size (as well as low and high)

Layout picture

rest of stack

low:. . .

high:. . .

A:

size of A: 10

control link

return addr.fp

i:...

A[9]

. . .

A[0]

...sp

AR of call to SUM

The picture and the slide simple says: if an array passed as argument is allowed to have a non-fixed size, that’s fine. When passing the array, the size is known, just store the size at oneparticular, agreed upon place in the activation record (here offset 6), and then used the value foryour calculation when accessing a slot. So, compared to the previous handling of arrays, there isjust one layer of indirection involved.


Nested declarations (“compound statements”)

C Code

void p ( int x , double y ){ char a ;

int i ;. . . ;

A: { double x ;int j ;. . . ;

}. . . ;

B: { char ∗ a ;int k ;. . . ;

} ;. . . ;

}

Nested blocks layout (1)

rest of stack

x:

y:

control link

return addr.fp

a:

i:

x:

j:

...sp

area for block A allocated

298 8 Run-time environments8.5 Stack-based RTE with nested procedures

Nested blocks layout (2)

rest of stack

x:

y:

control link

return addr.fp

a:

i:

a:

k:

...sp

area for block B allocated

Explanations

The terminology of compound statements seems not widely used, at least not in the sense usedhere. The gist of the example is: if one has local scopes of that kind (here called A and B, there isno need to allocate space for both (in that way it’s treated in the same spirit as union types). Thespace for the local variables from the first scope maybe reused for the needs of the second. Thereis also no need to officially “push” and “pop” activation records following the calling conventions(though nested scopes do follow a stack-discipline and they could be treated as “inlined” calls toanonymous, parameterless procedures).

8.5 Stack-based RTE with nested procedures

What follows in this section (illustrated with Pascal), is to relax one restriction we had so far wrt.the nature of variables. It may not have been obvious, but it should become so now: We wereoperating with a C-like language, by which one mean: lexical scoping and non-nested functions orprecedures. That means: there are only two “kinds” of variables: global ones (which are static)and local ones (which are in the current stack frame. The local ones can be accessed by offsetsfrom the frame pointer.

Now, with nested procedures (and still lexical scoping) there are variables neither static nor residingthe the current stack frame. So we need a way to access those during run-time. That will be done(in a Pascal-like language), introducing static links.

8 Run-time environments8.5 Stack-based RTE with nested procedures 299


The code is in some form of Pascal. The comments after the begin and end statements indicateto which procedure that part belong. Since q is nested in p, and since p has a local variable nin the same scope, this local variable n is accessible inside q. At run-time, in an call to q, thecorresponding activation record will reside on the run-time stack. If the body of q makes useof n (not explicitly shown in the skeletal code), it needs a way to locate the content. From theperspective of q, the variable is neither local to q nor global. It’s of course local to . . .

program nonLocalRef ;procedure p ;var n : integer ;

procedure q ;begin

(∗ a r e f t o n i s nownon−l o c a l , non−g l o b a l ∗)

end ; (∗ q ∗)

procedure r ( n : integer ) ;begin

q ;end ; (∗ r ∗)

begin (∗ p ∗)n := 1 ;r ( 2 ) ;

end ; (∗ p ∗)

begin (∗ main ∗)p ;

end .

• proc. p contains q and r nested• also “nested” (i.e., local) in p: integer n

– in scope for q and r but– neither global nor local to q and r

Accessing non-local var’s

Stack layout

vars of main

control linkreturn addr.

n:1

p

n:2control linkreturn addr.

r

control linkfp

return addr.sp

q

...

calls m → p → r → q

• n in q: under lexical scoping: n declared in procedure p is meant• this is not reflected in the stack (of course) as this stack represents the run-time call stack.• remember: static links (or access links) in connection with symbol tables


Symbol tables

• “name-addressable” mapping• access at compile time• cf. scope tree

Dynamic memory

• “adresss-adressable” mapping• access at run time• stack-organized, reflecting paths in call graph• cf. activation tree

Access link as part of the AR

Stack layout

vars of main

(no access link)

control link

return addr.

n:1

n:2

access link

control link

return addr.

access link

control linkfp

return addr.sp

...

calls m → p → r → q

• access link (or static link): part of AR (at fixed position)• points to stack-frame representing the current AR of the statically enclosed “procedural”

scope

Example with multiple levels

program chain ;

procedure p ;var x : integer ;

procedure q ;procedure r ;begin

x :=2;. . . ;i f . . . then p ;

end ; (∗ r ∗)begin

8 Run-time environments8.5 Stack-based RTE with nested procedures 301

r ;end ; (∗ q ∗)

beginq ;

end ; (∗ p ∗)

begin (∗ main ∗)p ;

end .

Access chaining

Layout

AR of main

(no access link)

control link

return addr.

x:1

access link

control link

return addr.

access link

control linkfp

return addr.sp

...

calls m → p → q → r

• program chain• access (conceptual): fp.al.al.x• access link slot: fixed “offset” inside AR (but: AR’s differently sized)• “distance” from current AR to place of x

– not fixed, i.e.– statically unknown!

• However: number of access link dereferences statically known• lexical nesting level

Implementing access chaining

As example:

fp.al.al.al. ... al.x

• access need to be fast => use registers• assume, at fp in dedicated register

4( fp ) −> reg // 14( reg ) −> reg // 2. . .4( reg ) −> reg // n = d i f f e r e n c e i n n e s t i n g l e v e l s6( reg ) // a c c e s s content of x


• often: not so many block-levels/access chains nessessary

Calling sequence

• For procedure call (entry)1. compute arguments, store them in the correct positions in the new activation record of the

procedure (pushing them in order onto the runtume stack will achieve this)2. – push access link, value calculated via link chaining (“ fp.al.al.... ”)

– store (push) the fp as the control link in the new AR3. change fp, to point to the “beginning”

of the new AR. If there is an sp, copying sp into fp at this point will achieve this.1. store the return address in the new AR, if necessary2. perform a jump to the code of the called procedure.3. Allocate space on the stack for local var’s by appropriate adjustement of the sp

• procedure exit1. copy the fp to the sp2. load the control link to the fp3. perform a jump to the return address4. change the sp to pop the arg’s and the access link

Calling sequence: with access links

Layout

AR of main(no access link)

control linkreturn addr.

x:...

access linkcontrol linkreturn addr.access linkcontrol linkreturn addr.

no access linkcontrol linkreturn addr.

x:...

access linkcontrol linkreturn addr.access linkcontrol link

fpreturn addr.

sp...

after 2nd call to r

• main → p → q → r → p → q → r• calling sequence: actions to do the “push & pop”• distribution of responsibilities between caller and callee• generate an appropriate access chain, chain-length statically determined• actual computation (of course) done at run-time

8 Run-time environments8.6 Functions as parameters 303

8.6 Functions as parameters


Access link (again)

Procedures as parameter

program c l o s u r e e x ( output ) ;

procedure p ( procedure a ) ;begin

a ;end ;

procedure q ;var x : integer ;

procedure r ;begin

writeln ( x ) ; // ``non−l o c a l ' 'end ;

beginx := 2 ;p ( r ) ;

end ; (∗ q ∗)

begin (∗ main ∗)q ;

end .

Procedures as parameters, same example in Go


var p = func ( a ( func ( ) ( ) ) ) { // ( u n i t −> u n i t ) −> u n i ta ( )

}

var q = func ( ) {var x = 0var r = func ( ) {fmt . P r i n t f ( " x = %v " , x )}x = 2p ( r ) // r as argument

}

func main ( ) {q ( ) ;

}

Procedures as parameters, same example in ocaml

l e t p ( a : u n i t −> u n i t ) : u n i t = a ( ) ; ;

l e t q ( ) =l e t x : i n t r e f = r e f 1in l e t r = function ( ) −> ( p r i n t _ i n t ! x ) (∗ d e r e f ∗)inx := 2 ; (∗ a s s i g n m e n t t o r e f−t y p e d var ∗)p ( r ) ; ;

q ( ) ; ; (∗ `` body o f main ' ' ∗)

304 8 Run-time environments8.6 Functions as parameters

Closures and the design of ARs

• [9] rather “implementation centric”• closure there:

– restricted setting– specific way to achieve closures– specific semantics of non-local vars (“by reference”)

• higher-order functions:– functions as arguments and return values– nested function declaration

• similar problems with: “function variables”• Example shown: only procedures as parameters, not returned

Closures, schematically

• independent from concrete design of the RTE/ARs:• what do we need to execute the body of a procedure?

Closure (abstractly)

A closure is a function body2 together with the values for all its variables, including the non-local ones.2

• individual AR not enough for all variables used (non-local vars)• in stack-organized RTE’s:

– fortunately ARs are stack-allocated→ with clever use of “links” (access/static links): possible to access variables that are “nested

further out”/ deeper in the stack (following links)

Organize access with procedure parameters

• when calling p: allocate a stack frame• executing p calls a => another stack frame• number of parameters etc: knowable from the type of a• but 2 problems

“control-flow” problem

currently only RTE, but: how can (the compiler arrange that) p calls a (and allocate a frame for a) if ais not know yet?

data problem

How can one statically arrange that a will be able to access non-local variables if statically it’s not knownwhat a will be?

• solution: for a procedure variable (like a): store in AR– reference to the code of argument (as representation of the function body)– reference to the frame, i.e., the relevant frame pointer (here: to the frame of q where r is

defined)• this pair = closure!

2Resp.: at least the possibility to locate them.


Closure for formal parameter a of the example

• stack after the call to p• closure 〈ip, ep〉• ep: refers to q’s frame pointer• note: distinction in calling sequence for

– calling “ordinary” proc’s and– calling procs in proc parameters (i.e., via closures)

• that may be unified (“closures” only)

After calling a (= r)

• note: static link of the new frame: used from the closure!

306 8 Run-time environments8.6 Functions as parameters

Making it uniform

• note: calling conventions differ– calling procedures as formal parameters– “standard” procedures (statically known)

• treatment can be made uniform

Limitations of stack-based RTEs

• procedures: central (!) control-flow abstraction in languages• stack-based allocation: intuitive, common, and efficient (supported by HW)• used in many languages• procedure calls and returns: LIFO (= stack) behavior• AR: local data for procedure body

Underlying assumption for stack-based RTEs

The data (=AR) for a procedure cannot outlive the activation where they are declared.

• assumption can break for many reasons– returning references of local variables– higher-order functions (or function variables)– “undisciplined” control flow (rather deprecated, goto’s can break any scoping rules, or procedure

abstraction)– explicit memory allocation (and deallocation), pointer arithmetic etc.

Dangling ref’s due to returning references

int ∗ dangle ( void ) {int x ; // l o c a l varreturn &x ; // a d d r e s s o f x

}

• similar: returning references to objects created via new• variable’s lifetime may be over, but the reference lives on . . .


Function variablesprogram Funcvar ;var pv : Procedure ( x : integer ) ; (∗ p r o c e d u r var ∗)

Procedure Q( ) ;var

a : integer ;Procedure P( i : integer ) ;begin

a:= a+i ; (∗ a def ' ed o u t s i d e ∗)end ;

beginpv := @P; (∗ `` r e t u r n ' ' P ( as s i d e e f f e c t ) ∗)

end ; (∗ "@" d e p e n d e n t on d i a l e c t ∗)begin (∗ h e r e : f r e e P a s c a l ∗)

Q( ) ;pv ( 1 ) ;

end .

funcvarRuntime error 216 at $0000000000400233

$0000000000400233$0000000000400268$00000000004001E0

Functions as return valuespackage mainimport ( " fmt " )

var f = func ( ) ( func ( int ) int ) { // u n i t −> ( i n t −> i n t )var x = 40 // l o c a l v a r i a b l evar g = func ( y int ) int { // n e s t e d f u n c t i o n

return x + 1}x = x+1 // u p d a t e xreturn g // f u n c t i o n as r e t u r n v a l u e

}

func main ( ) {var x = 0var h = f ( )fmt . P r i n t l n ( x )var r = h ( 1 )fmt . P r i n t f ( " r = %v " , r )

}

• function g– defined local to f– uses x, non-local to g, local to f– is being returned from f

Fully-dynamic RTEs

• full higher-order functions = functions are “data” same as everything else– function being locally defined– function as arguments to other functions– functions returned by functions

→ ARs cannot be stack-allocated• closures needed, but heap-allocated (6= Louden)• objects (and references): heap-allocated• less “disciplined” memory handling than stack-allocation• garbage collection• often: stack based allocation + fully-dynamic (= heap-based) allocation

The stack discipline can be seen as a particularly simple (and efficient) form of garbage collection: returningfrom a function makes it clear that the local data can be thrashed.

308 8 Run-time environments8.7 Parameter passing

8.7 Parameter passing

Communicating values between procedures

• procedure abstraction, modularity• parameter passing = communication of values between procedures• from caller to callee (and back)• binding actual parameters• with the help of the RTE• formal parameters vs. actual parameters• two modern versions

1. call by value2. call by reference

CBV and CBR, roughly

Core distinction/question

on the level of caller/callee activation records (on the stack frame): how does the AR of the callee get holdof the value the caller wants to hand over?

1. callee’s AR with a copy of the value for the formal parameter2. the callee AR with a pointer to the memory slot of the actual parameter

• if one has to choose only one: it’s call-by-value• remember: non-local variables (in lexical scope), nested procedures, and even closures:

– those variables are “smuggled in” by reference– [NB: there are also by value closures]

CBV is in a way the prototypical, most dignified way of parameter passsing, supporting the procedureabstraction. If one has references (explicit or implicit, of data on the heap, typically), then one has call-by-value-of-references, which, in some way “feels” for the programmer as call-by-reference. Some peopleeven call that call-by-reference, even if it’s technically not.

Parameter passing by-value

• in C: CBV only parameter passing method• in some lang’s: formal parameters “immutable”• straightforward: copy actual parameters → formal parameters (in the ARs).

C examples

void i n c 2 ( int x ){ ++x , ++x ; }

void i n c 2 ( int ∗ x ){ ++(∗x ) , ++(∗x ) ; }/∗ c a l l : i n c (&y ) ∗/

void i n i t ( int x [ ] , int s i z e ) {int i ;for ( i =0; i<s i z e ,++ i ) x [ i ]= 0

}

8 Run-time environments8.7 Parameter passing 309

arrays: “by-reference” data

Call-by-reference

• hand over pointer/reference/address of the actual parameter• useful especially for large data structures• typically (for cbr): actual parameters must be variables• Fortran actually allows things like P(5,b) and P(a+b,c).

void i n c 2 ( int ∗ x ){ ++(∗x ) , ++(∗x ) ; }/∗ c a l l : i n c (&y ) ∗/

void P( p1 , p2 ) {. .p1 = 3

}var a , b , c ;P( a , c )

Call-by-value-result

• call-by-value-result can give different results from cbr• allocated as a local variable (as cbv)• however: copied “two-way”

– when calling: actual → formal parameters– when returning: actual ← formal parameters

• aka: “copy-in-copy-out” (or “copy-restore”)• Ada’s in and out paremeters• when are the value of actual variables determined when doing “actual ← formal parameters”

– when calling– when returning

• not the cleanest parameter passing mechanism around. . .

Call-by-value-result example

void p ( int x , int y ){

++x ;++y ;

}

main ( ){ int a = 1 ;

p ( a , a ) ; // :−O

310 8 Run-time environments8.7 Parameter passing

return 0 ;}

• C-syntax (C has cbv, not cbvr)• note: aliasing (via the arguments, here obvious)• cbvr: same as cbr, unless aliasing “messes it up”3

Call-by-name (C-syntax)

• most complex (or is it . . . ?)• hand over: textual representation (“name”) of the argument (substitution)• in that respect: a bit like macro expansion (but lexically scoped)• actual paramater not calculated before actually used!• on the other hand: if needed more than once: recalculated over and over again• aka: delayed evaluation• Implementation

– actual paramter: represented as a small procedure (thunk, suspension), if actual parameter =expression

– optimization, if actually parameter = variable (works like call-by-reference then)

Call-by-name examples

• in (imperative) languages without procedure parameters:– delayed evaluation most visible when dealing with things like a[i]– a[i] is actually like “apply a to index i”– combine that with side-effects (i++) ⇒ pretty confusing

Example 1

void p ( int x ) { . . . ; ++x ; }

• call as p(a[i])• corresponds to ++(a[i])• note:

– ++ _ has a side effect– i may change in ...

Example 2

int i ;int a [ 1 0 ] ;void p ( int x ) {

++i ;++x ;

}

main ( ) {i = 1 ;a [ 1 ] = 1 ;a [ 2 ] = 2 ;p ( a [ i ] ) ;return 0 ;

}

3One can ask though, if not call-by-reference would be messed-up in the example already.

8 Run-time environments8.7 Parameter passing 311

Another example: “swapping”

int i ; int a [ i ] ;

swap ( int a , b ) {int i ;i = a ;a = b ;b = i ;

}

i = 3 ;a [ 3 ] = 6 ;

swap ( i , a [ i ] ) ;

• note: local and global variable i

Call-by-name illustrations

Code

procedure P( par ) : name par , i n t parbegin

i n t x , y ;. . .par := x + y ; (∗ a l t e r n a t i v e : x := par + y ∗)

end ;

P( v ) ;P( r . v ) ;P ( 5 ) ;P( u+v )

v r.v 5 u+vpar := x+y ok ok error errorx := par +y ok ok ok ok

Call by name (Algol)

begin comment Simple a r r a y example ;p r o c e d u r e z e r o ( Arr , i , j , u1 , u2 ) ;i n t e g e r Arr ;i n t e g e r i , j , u1 , u2 ;

b e g i nf o r i := 1 s t e p 1 u n t i l u1 do

f o r j := 1 s t e p 1 u n t i l u2 doArr := 0

end ;

i n t e g e r a r r a y Work [ 1 : 1 0 0 , 1 : 2 0 0 ] ;i n t e g e r p , q , x , y , z ;x := 1 0 0 ;y := 200z e r o ( Work [ p , q ] , p , q , x , y ) ;end

Lazy evaluation

• call-by-name– complex & potentially confusing (in the presence of side effects)– not really used (there)

• declarative/functional languages: lazy evaluation• optimization:

312 8 Run-time environments8.8 Virtual methods in OO

– avoid recalculation of the argument⇒ remember (and share) results after first calculation (“memoization”)– works only in absence of side-effects

• most prominently: Haskell• useful for operating on infinite data structures (for instance: streams)

Lazy evaluation / streams

magic : : Int −> Int −> [ Int ]magic 0 _ = [ ]magic m n = m : ( magic n (m+n ) )

g e t I t : : [ Int ] −> Int −> Intg e t I t [ ] _ = undefinedg e t I t ( x : xs ) 1 = xg e t I t ( x : xs ) n = g e t I t xs ( n−1)

8.8 Virtual methods in OO

Object-orientation

• class-based/inheritance-based OO• classes and sub-classes• typed references to objects• virtual and non-virtual methods

Virtual and non-virtual methods + fields

c l a s s A {int x , y

void f ( s , t ) { . . . $F_A$ . . . } ;virtual void g ( p , q ) { . . . $G_A$ . . . } ;

} ;

c l a s s B extends A {int z

void f ( s , t ) { . . . $F_B$ . . . } ;r e d e f void g ( p , q ) { . . . $G_B$ . . . } ;virtual void h ( r ) { . . . $H_B$ . . . }

} ;

c l a s s C extends B {int u ;r e d e f void h ( r ) { . . . $H_C$ . . . } ;

}

8 Run-time environments8.8 Virtual methods in OO 313

Call to virtual and non-virtual methods

non-virtual method f

call targetrA.f FA

rB .f FB

rC .f FB

virtual methods g and h

call targetrA.g GA or GB

rB .g GB

rC .g GB

rA.h illegalrB .h HB or HC

rC .h HC

314 8 Run-time environments8.8 Virtual methods in OO

Late binding/dynamic binding

• details very much depend on the language/flavor of OO– single vs. multiple inheritance?– method update, method extension possible?– how much information available (e.g., static type information)?

• simple approach: “embedding” methods (as references)– seldomly done (but needed for updateable methods)

• using inheritance graph– each object keeps a pointer to its class (to locate virtual methods)

• virtual function table– in static memory– no traversal necessary– class structure need be known at compile-time– C++

Virtual function table

• static check (“type check”) of rX .f()– for virtual methods: f must be defined in X or one of its superclasses

• non-virtual binding: finalized by the compiler (static binding)• virtual methods: enumerated (with offset) from the first class with a virtual method, redefinitions

get the same “number”• object “headers”: point to the class’s virtual function table• rA.g():

c a l l r_A . v i r t t a b [ g _ o f f s e t ]

• compiler knows– g_offset = 0– h_offset = 1

8 Run-time environments8.8 Virtual methods in OO 315

Virtual method implementation in C++

• according to [9]

c l a s s A {p u b l i c :double x , y ;void f ( ) ;v i r t u a l void g ( ) ;

} ;

c l a s s B: p u b l i c A {p u b l i c :double z ;void f ( ) ;v i r t u a l void h ( ) ;

} ;

Untyped references to objects (e.g. Smalltalk)

• all methods virtual• problem of virtual-tables now: virtual tables need to contain all methods of all classes• additional complication: method extension, extension methods• Thus: implementation of r.g() (assume: f omitted)

– go to the object’s class

316 8 Run-time environments8.9 Garbage collection

– search for g following the superclass hierarchy.

8.9 Garbage collection

Management of dynamic memory: GC & alternatives

• dynamic memory: allocation & deallocation at run-time• different alternatives

1. manual– “alloc”, “free”– error prone

2. “stack” allocated dynamic memory– typically not called GC

3. automatic reclaim of unused dynamic memory– requires extra provisions by the compiler/RTE

Heap

• “heap” unrelated to the well-known heap-data structure from A&D• part of the dynamic memory• contains typically

– objects, records (which are dynamocally allocated)– often: arrays as well– for “expressive” languages: heap-allocated activation records

∗ coroutines (e.g. Simula)∗ higher-order functions

https://en.wikipedia.org/wiki/Simula

8 Run-time environments8.9 Garbage collection 317

code area

global/static area

stack

free space

heap

Memory

Problems with free use of pointers

int ∗ dangle ( void ) {int x ; // l o c a l varreturn &x ; // a d d r e s s o f x

}

typedef int (∗ proc ) ( void ) ;

proc g ( int x ) {int f ( void ) { /∗ i l l e g a l ∗/

return x ;}return f ;

}

main ( ) {proc c ;c = g ( 2 ) ;p r i n t f ( "%d\n " , c ( ) ) ; /∗ 2? ∗/return 0 ;

}

• as seen before: references, higher-order functions, coroutines etc ⇒ heap-allocated ARs• higher-order functions: typical for functional languages,• heap memory: no LIFO discipline• unreasonable to expect user to “clean up” AR’s (already alloc and free is error-prone)• ⇒ garbage collection (already dating back to 1958/Lisp)

Some basic design decisions

• gc approximative, but non-negotiable condition: never reclaim cells which may be used in the future• one basic decision:

1. never move “objects”– may lead to fragmentation

2. move objects which are still needed– extra administration/information needed– all reference of moved objects need adaptation– all free spaces collected adjacently (defragmentation)

• when to do gc?• how to get info about definitely unused/potentially used obects?

– “monitor” the interaction program ↔ heap while it runs, to keep “up-to-date” all the time– inspect (at approriate points in time) the state of the heap

Objects here are meant as heap-allocated entities, which in OO languages includes objects, but herereferring also to other data (records, arrays, closures . . . ).


Mark (and sweep): marking phase

• observation: heap addresses only reachable

directly through variables (with references), kept in the run-time stack (or registers)indirectly following fields in reachable objects, which point to further objects . . .

• heap: graph of objects, entry points aka “roots” or root set• mark: starting from the root set:

– find reachable objects, mark them as (potentially) used– one boolean (= 1 bit info) as mark– depth-first search of the graph

Marking phase: follow the pointers via DFS

• layout (or “type”) of objects need to be known to determine where pointers are• food for thought: doing DFS requires a stack, in the worst case of comparable size as the heap itself

. . . .

Compactation

Marked

8 Run-time environments8.9 Garbage collection 319

Compacted

After marking?

• known classification in “garbage” and “non-garbage”• pool of “unmarked” objects• however: the “free space” not really ready at hand:• two options:

1. sweep– go again through the heap, this time sequentially (no graph-search)– collect all unmarked objects in free list– objects remain at their place– RTE need to allocate new object: grab free slot from free list

2. compaction as well:– avoid fragmentation– move non-garbage to one place, the rest is big free space– when moving objects: adjust pointers

Stop-and-copy

• variation of the previous compactation• mark & compactation can be done in recursive pass• space for heap-managment

– split into two halves– only one half used at any given point in time– compactation by copying all non-garbage (marked) to the currently unused half


Step by step

9 Intermediate code generation 321

Intermediate code generationChapter

Whatis it


1. intermediate code2. three-address code and P-code3. translation to those forms4. translation between those forms

Contents

9.1 Intro . . . . . . . . . . . . . . 3219.2 Intermediate code . . . . . . . 3259.3 Three address (intermedi-

ate) code . . . . . . . . . . . 3269.4 P-code . . . . . . . . . . . . . 3299.5 Generating P-code . . . . . . 3319.6 Generation of three address

code . . . . . . . . . . . . . . 3389.7 Basic: From P-code to 3A-

Code and back: static simu-lation & macro expansion . . 343

9.8 More complex data types . . 3489.9 Control statements and log-

ical expressions . . . . . . . . 356

9.1 Intro

The chapter is called intermediate code generation. At the current stage in the lecture (and the current“stage” in a compiler) we have to process as input a abstract syntax tree which has been type-checked andwhich thus is equipped with relevant type information. As discussed, key type information is often notstored inside the AST, but associated with it via a symbol table. More precisely, the symbol table mostlystores type information for variables, identifiers, etc, not for all nodes of the AST, since that it typicallysufficient. As far as code generation is concerned, we have at least gotten a feeling for certain aspectsof code generation, without details, namely in connection with implementing high-level abstractions inconnection with data. The layout of how certain types can be implemented and how scoping, memorymanagement etc is arranged. As far as the control-part of a program is concerned (not the data part), wealso know that the run-time environment maintains a stack of return adresses to take care of the call-returnbehavior of the procedure abstraction. We have also seens (not in very much detail) the so-called callingconventions and calling sequences, low-level instructions that take care of “data-aspects” of maintainingthe procedure abstraction (taking care of parameter passing, etc). All of that was done, as said, notwith concrete (machine) code, but explaining what needs to be achieved and how those aspects (memorymanagement, stack-arrangement etc) are designed.

The task of code generation is to generate instructions which are put into code segment which is a part ofthe static part of the memory. That concept as discussed in the introductory part of the chapter coveringrun-time environments. Basically, to translate procedure bodies into sequences of instructions.

Ultimately, the generated instruction are binaries, resp. machine code, which is platform depedent. Gen-erating platform dependent code is this part of the back-end. However, the task of generating code is

322 9 Intermediate code generation9.1 Intro

split into generating first intermediate code and afterwards, “real code”. This chapter here is about thisintermediate code generation.

Making use of intermediate code not just done in this lecture. The use of some form if intermediate codeas another intermediate representation internal to the compiler is commonplace. The intermediate codemay take different forms, however, and we will encounter two flavors.

Why does one want another intermediate representation as opposed to go all the way to machine codein one step? There are a couple of reasons for that. The code generation may is not altogether trivial.Especially, since at the lower ends of the compiler, this is where one may throw many different and complexoptimizations at the task, So, modularizing the task into smaller subphases is good design. Related tothat: doing it stepwise helps in portability. The intermediate code still is kind of machine indepdented.It may resemble the instruction set of typical hardware (or more likely resembling a subset of such aninstruction set leaving out “esotheric” specialized commands some hardwares may offer). But it’s not theexact instruction set also in that the IR may still rely on some abstractions which are not available onany hardware binaries. That may involve that the IC still works with variables and temporaries, whereultimately the real code operates on addresses and registers.

If one has some “machine-code” resembling intermediate representation, the task of porting a compilerto a new platform is easier. Furthermore, one can start doing certain code analyses and optimizationalready on the IC, thereby making optimizations available for all platform-dependent backends, withoutreimplementing the wheel multiple times. Of course, analyses and optimizations could and should also bedone on the platform-depedent phase. For instance, of vital importance for the ultimate perfomance ofthe code is the good use of registers. That, however, is platform dependent: different chips offer differentamount of register memory and support different ways of using them, for instance for indexed access ofmain memory.

Also in the lecture here, the chapter here about intermedatiate code generation postpones the issue ofregisters for the subsequente phase and chapter.

We said, that IR is platform independent. That does not mean, that it may not be “influenced” by targetedplatforms. The are different flavors of instruction sets (RISC vs CISC, three-address code, two-addresscode etc), and the intermediate code has to make a choice what flavor of instructions it plans resemblemost.

We will deal with two prominent ways. One is a three-address code, the other one is P-code (which couldbe also called 1-address code). The latter one does not resembles typical instruction sets, but is a knownIC format nonetheless. It resembles (conceptually) byte-code.

Schematic anatomy of a compiler1

1This section is based on slides from Stein Krogdahl, 2015.

9 Intermediate code generation9.1 Intro 323

• code generator:– may in itself be “phased”– using additional intermediate representation(s) (IR) and intermediate code

A closer look

Various forms of “executable” code

• different forms of code: relocatable vs. “absolute” code, relocatable code from libraries, assembler,etc.

• often: specific file extensions– Unix/Linux etc.

∗ asm: *.s∗ rel: *.o∗ rel. from library: *.a∗ abs: files without file extension (but set as executable)

– Windows:∗ abs: *.exe2

• byte code (specifically in Java)– a form of intermediate code, as well– executable on the JVM– in .NET/C]: CIL

∗ also called byte-code, but compiled further

There are many different forms of code. One big distriction is between code “natively” executable, i.e.,on a particular (HW) platform on the one hand, and “byte code” or related concepts on the other. Thelatter is a Java-centric terminology, while the underlying concept is not. It’s actually sometimes calledp-code (representing portable code or interpreter code. It’s not natively executed but run in an interpreteror virtual machine (for Java byte code, that’s of course the JVM). The terminology “byte code” refers tothe fact that the op-codes, i.e., instructions of the byte code language, are intended to be represented byone byte. That piece of information, that opcodes fit into one byte, does not give much insight, though,and there may be many different “byte code representation”. They are often intendend to be executed ona virtual machine, but of course they can also be used as another intermediate representation (in the senseof the topic of this chapter). A virtual machine is an “machine” simulated in software, and the architecturecan resemble the execution mechanism of HW, or can follow principles typically not found in HW. Forexample, one typical architecture is a stack machine. One find also virtual machines that resemble registermachines.

2.exe-files include more, and “assembly” in .NET even more

324 9 Intermediate code generation9.1 Intro

We will look into two formats, one we call p-code, one we call three-address intermediate code (3AIC). Ascan be seen from the above remarks, the terminology is a bit unclear. P-code normally stands for portablecode, but 3AIC is also portable. P-code here resemebles (at least conceptually) Java byte code, but alsothe op-code of 3AIC would fit into one byte.

As further remark concerning interpretation and “virtual machines” and virtualization in general. Thedistinction between compilation and interpretation is not a matter of black and white. Already in theintroduction chapter, there was speaking of “full interpretation” where the execution is done directly onthe user syntax is rather seldom. When saying, directly on the syntax, that can also be abstract syntax,which is seen as “basically” as the programming language syntax, just stripped from the particularitiesof concrete syntax. But rewriting directly in the character string level is unpractical mostly. Interpretinga language on a virtual machine is already quite closer to machine exectition, the vitual machine workslike a software simulated machine model, and that may be more or less low-level. On the very lowest end,there are complete virtualization, where a whole operating system is simulated (often running multipleinstances of operating system “on the cloud”). In that case, one can generate native code.

As mentioned, we will discuss 3AIC and p-code. P-code may better be called one-address-code. A goodcriterion for different ICs is the format of the instructions, a better criterion at any rate than the “size”of the op-code (“byte”) or the fact that it’s portable (p-code). By format one mainly refers to how manyarguments (most of) the instructions take. One, two, three, there is even zero-address code. So, thatkind of format is one dimension for classification of intermediate code. Another dimension is what kindof addressing modes are supported. That has to do (often) with the use if registers. Not all intermediatecodes work with the concept of registers, for instance, in this lecture, the two formats are independent fromregisters, and we also don’t go into details here of indirect addressing and similar, which are often used inconnection with registers, but can also be understood independently.

As far as the different formats go: formats like 3AC and 2AC are common for nowaday’s HW. Thatmeans, that 3AIC is a viable format (resembling current HW). 1-address code and 0-address code is notreally found as HW design, but still a viable format for intermediate code. Especially for intermediatecode run on a virtual machine. One example is JVM and Java byte code. However, historically, thereare machine designs based on such idea. One very early was the British KDF9 computer, which used azero-address format and, more widely known, some designs from the Burroughs company (like the veryunique B5000). A programming language, which gives a feeling of stack-machine programming is Forth(there is a linux/gnu version of it (gforth)). Forth, in a way, lives on in the form of the well-knownPostscript language (run on printers), at least postscript is said to be inspired by Forth.

Remarks

• https://www.iare.ac.in/sites/default/files/PPT/CO%20Lecture%20Notes.pdf: in-structions formats

Generating code: compilation to machine code

• 3 main forms or variations:1. machine code in textual assembly format (assembler can “compile” it to 2. and 3.)2. relocatable format (further processed by loader)3. binary machine code (directly executable)

• seen as different representations, but otherwise equivalent• in practice: for portability

– as another intermediate code: “platform independent” abstract machine code possible.– capture features shared roughly by many platforms

∗ e.g. there are stack frames, static links, and push and pop, but exact layout of the framesis platform dependent

– platform dependent details:∗ platform dependent code∗ filling in call-sequence / linking conventions

done in a last step

http://www.cs.man.ac.uk/CCS/res/res18.htm#c

https://www.smecc.org/The%20Architecture%20%20of%20the%20Burroughs%20B-5000.htm

https://en.wikipedia.org/wiki/PostScript

https://www.iare.ac.in/sites/default/files/PPT/CO%20Lecture%20Notes.pdf

9 Intermediate code generation9.2 Intermediate code 325

Byte code generation

• semi-compiled well-defined format• platform-independent• further away from any HW, quite more high-level• for example: Java byte code (or CIL for .NET and C])

– can be interpreted, but often compiled further to machine code (“just-in-time compiler” JIT)• executed (interpreted) on a “virtual machine” (JVM)• often: stack-oriented execution code (in post-fix format)• also internal intermediate code (in compiled languages) may have stack-oriented format (“P-code”)

9.2 Intermediate code

Use of intermediate code

• two kinds of IC covered1. three-address code (3AC, 3AIC)

– generic (platform-independent) abstract machine code– new names for all intermediate results– can be seen as unbounded pool of maschine registers– advantages (portability, optimization . . . )

2. P-code (“Pascal-code”, cf. Java “byte code”)– originally proposed for interpretation– now often translated before execution (cf. JIT-compilation)– intermediate results in a stack (with postfix operations)

• many variations and elaborations for both kinds– addresses represented symbolically or as numbers (or both)– granularity/“instruction set”/level of abstraction: high-level op’s available e.g., for array-access

or: translation in more elementary op’s needed.– operands (still) typed or not– . . .

Various translations in the lecture

Text

• AST here: tree structure after semantic analysis, let’s call it AST+ or just simply AST.• translation AST ⇒ P-code: appox. as in Oblig 2• we touch upon general problems/techniques in “translations”• one (important) aspect ignored for now: register allocation

326 9 Intermediate code generation9.3 Three address (intermediate) code

Picture

AST+

TAIC p-code

9.3 Three address (intermediate) code

Three-address code

• common (form of) IR

TA: Basic format

x = y op z

• x, y, z: names, constants, temporaries . . .• some operations need fewer arguments

• example of a (common) linear IR• linear IR: ops include control-flow instructions (like jumps)• alternative linear IRs (on a similar level of abstraction): 1-address (or even 0) code (stack-machine

code), 2 address code• well-suited for optimizations• modern architectures often have 3-address code like instruction sets (RISC-architectures)

3AC example (expression)

2*a+(b-3)

+

*

2 a

-

b 3

9 Intermediate code generation9.3 Three address (intermediate) code 327

Three-address code

t1 = 2 ∗ at2 = b − 3t3 = t1 − t2

alternative sequence

t1 = b − 3t2 = 2 ∗ at3 = t2 − t1

We encountered the notion of temporaries already in connection with the activation records. There, theactivation records for some function needs space for various things, like parameters, local variables, returnaddresses etc., but also for intermediate results. That’s the temporary variables of the intermediate codeor temporaries for short, which we talk about here. The slide shows two versions that do the same thing.This is not a very deep difference between the two versions. It captures that the fact order of evaluationdoes not matter. For the people that like to split hairs: it does not matter under the assumpion that thereare no “exceptions”, for instance that 2 * a does not lead to a numerical overflow. If additionally a andb refer to the same content, then it could be that the first code faults, whereas the second version maycalculate properly (since a = b is decreased first before the multiplication.

In our code examples, though, the convention is: different variable names mean different memory loca-tions, so by writing a and b, there is no aliasing. Of course, if the 3AIC uses references (resp. indirectaddressing), then different variable names don’t guarantee absence of aliasing. A related remark concernsthe temporaries. The example uses three different ones t1, t2, and t3. Using different names for the tem-porary indicate that they are all different. However, that may look like a waste of memory: One couldhave “optimized” it by perhaps avoiding t3 and reuse t2 or t3. One could indeed, but the code generationat the current stage does not try to cut down on the use of temporaries. For each intermediate result, ituses just a new, fresh temporary. It is the task of later stages, to do something about it, like minimizingthe number of temporaries (and put as many of them into registers). However, the amount of registersis typically only known at the platform-dependent stage. Most intermediate code formats (like ours) areunaware of registers or, in other words, assume a (abstract) machine model without registers.

Using a fresh temporary each time we need one means, each temporary is assigned-to only once (at leastif we ignore loops). That restriction is sometimes called static single assignment. Static means, there isonly one line in the code (“statically”) where a variable is assigned to. Dynamically, because of loops orsubroutines, a variable may be assigned to more than once. Note that that SSA restriction applies totemporaries only, user-level variables may be assigned to multiple times.

There is also the possibility, to make also the standard variables to follow the SSA regime. This is popularand has advantages concerning subsequent semantic analyses and optimization. In its generality, SSA abit more complex than just using new variables all the time. Therefore we won’t go into that.

The terminology of pseudo instruction comes from the fact that there is no real instruction connected toit. It’s just a way to refer to the corresponding line number a bit more abstractly. So, in a similar way thattemporaries are a representation of abstraction at the current of memory locations (ultimately addresses inmain memory if registers cannot be used), labels are an representation of addresses, ultimately translatedto relocatable addresses and ultimately to addresses in the code segment.

3AIC instruction set

• basic format: x = y op z• but also:

– x = op z– x = y

• operators: +,-,*,/, <, >, and, or• readx, writex• labelL (sometimes called a “pseudo-instruction”)• conditional jumps: if_false x goto L

328 9 Intermediate code generation9.3 Three address (intermediate) code

• t1, t2, t3 . . . . (or t1, t2, t3, . . . ): temporaries (or temporary variables)– assumed: unbounded reservoir of those– note: “non-destructive” assignments (single-assignment)

Illustration: translation to 3AIC

Source

read x ; { input an integer }i f 0<x then


f a c t := f a c t ∗ x ;x := x −1

until x = 0 ;w r i t e f a c t { output :

f a c t o r i a l of x }end

Target: 3AIC

read xt1 = x > 0i f _ f a l s e t1 goto L1f a c t = 1label L2t2 = f a c t ∗ xf a c t = t2t3 = x − 1x = t3t4 = x == 0i f _ f a l s e t4 goto L2write f a c tlabel L1halt

Variations in the design of TA-code

• provide operators for int, long, float . . . .?• how to represent program variables

– names/symbols– pointers to the declaration in the symbol table?– (abstract) machine address?

• how to store/represent TA instructions?– quadruples: 3 “addresses” + the op– triple possible (if target-address (left-hand side) is always a new temporary)

Quadruple-representation for 3AIC (in C)

typedef enum {rd , gr , i f _ f , asn , lab , mul ,sub , eq , wri , halt , . . . } OpKind ;

typedef enum {Empty , IntConst , S t r i n g } AddrKind ;

typedef struct {AddrKind kind ;union {

int v a l ;char ∗ name ;

} c o n t e n t s ;} Address ;

typedef struct {OpKind op ;Address addr1 , addr2 , addr3 ;

} Quad

9 Intermediate code generation9.4 P-code 329

A 3A(I)C has three addresses and one piece of information to specify the instruction itself. That makes4 pieces of information, a quadruple. The code illustrate how one could represent it in C. It wouldlook analogous to some extent in other languages. As a reminder of the typing section: we see how therepresentation uses the (not-so-type-safe) union type of C, to squeeze a few bits. We also see the use ofso-called enum type for finite enumerations.

The code is meant as illustration of how it can be done, but it depends obviously on details of thespecification of the intermediate code and the supported types (here called kinds in the code).

9.4 P-code

As mentioned, one of the two formats covered in the lection could be called p-code. We also said thatthe terminolgy is not so informative. Perhaps a better name would be one-address code. There is evenzero-address code (which works similarly), but we don’t cover it. Both one-address code and zero-addresscode have in common that they rely heavily on stack-manipulations. Very roughly, where 3AIC usestemporaries to store intermediate results, p-code stores those on the stack. We will see details for bothlater, when we look how to compile to either intermediate code format.

So we cover 3AIC and “1AIC” (p-code), there is also 2AC / 2AIC, which we will not cover, at least notin this chapter. For the real code generation, we may have a look at the problem: how to generate 2ACfrom 3AIC, in particular how to deal with registers (assuming a 2AC hardware platform)

P-code

• different common intermediate code / IR• aka “one-address code”3 or stack-machine code• used prominently for Pascal• remember: post-fix printing of syntax trees (for expressions) and “reverse polish notation”

P-code is an abbreviation for portable code. Some people also connect it to Pascal (like p stands forPascal). Many Pascal compilers were based in p-code for reasons of portability. Pascal was influentialsome time ago, especially for computer science curricula. The so-called p-code machine was not inventedfor Pascal or by the Pascal-people, but perhaps Pascal was the most prominent language “run” on a p-codearchitecture. So, in a way, p-code was some LLVM of the 70ies. . .

Example: expression evaluation 2*a+(b-3)

ldc 2 ; load c o n s t a n t 2lod a ; load value of v a r i a b l e ampi ; i n t e g e r m u l t i p l i c a t i o nlod b ; load value of v a r i a b l e bldc 3 ; load c o n s t a n t 3sbi ; i n t e g e r s u b s t r a c t i o nadi ; i n t e g e r a d d i t i o n

The code should be clear enough (with the help of the commentaries on the right-hand column). This firstexample is concern with expression evaluation, i.e., without side effects. Those work in the mentioned “post-fix” manner. The expression is built-up from binary operators. Those work in a stack-like virtual machineas follow: both arguments have to be on top of the stack, then executing the opcode corresponding to thebinary operators takes those top to elements and removes them them from the stack (“pop”), connectsthem as argments of the operation, and the result is the the new top of the stack (“push”).

3There’s also two-address codes, but those have fallen more or less in disuse.

330 9 Intermediate code generation9.4 P-code

That pattern can be seen clearly in the code 3 times (there are three operators to be translated, addition,multiplication, and substraction). Constants and variables are pushed onto the stack by correspondingload-commands (ldo and ldc).

Loading the content of a variable with ldo, as shown in this example, is only one way to to “load avariable”, namely loading its content. There is a second way, namely loading the address of a variable.That is not needed for evaluating expression, and therefore not part of this example. The next slidetranslates an assignment to 3AIC. In the example, we see both version of the load-command.

P-code for assignments: x := y + 1

• assignments:– variables left and right: L-values and R-values– cf. also the values ↔ references/addresses/pointers

lda x ; load a d d r e s s of xlod y ; load value of yldc 1 ; load c o n s t a n t 1adi ; addsto ; s t o r e top to a d d r e s s

; below top & pop both

The message of this example concerns the treatment of variables, in particular the fact that variables onthe left-hand side of an assignment are treated differently from those on the right-hand side. For theprogrammar (in an imperative language), the distinction may not always be too visible. Of course, one isaware that in an assignement, like the one shown in the code, the variable on the left hand side is assignedto, the variable on the right-hand side is read from. Everyone knows that. We write := for assignments,to make the distinction more visible. In languages like C and Java, that is not visible, one writes = forassignment, but it’s not equality: it’s not symmetric in that a=b is not the same b=a, when = is meant asassignment.

In the generated code, we see another (related) difference, which may be less obvious. For x, the addressis loaded as part of a step, for y it’s the content. We need the address of x to store back the result at theend of the generated code.

We mentioned that the stack-machine architecture leads to a post-fix treatment of evaluation. That is trueas long as one interprets “evaluation” as determining, in a side-effect free manner the value of expression(like in the previous example). Now, in this example, there are side-effects and the strict post-fix schemadoes not work any longer: the first thing to do is load the address of x with lda, i.e., that’s not “post-fix”,that is “pre-fix” treatment.

Finally a comment to the last opcode sto: it takes arguments (on the stack), and stores, in the example,the result of the computation to the given address (which here is the address of x). Additionally, bothtop elements are popped off the stack. Consequently, the value as the result of the commputation onthe right-hand side is no longer available. So, this translation does not correspond to the semantics ofassignments in languages like C and Java. There, things like (x := y +1) + 5 are allowed, but for acompilation of a languages with this kind of semantics, the sto command, popping off both elements, isnot the best choice. We see below an alternative operation, stn, which abbreviates store non-destructively,which would be adequate if one had a semantics as in Java or C.

P-code of the faculty function

Source



f a c t := f a c t ∗ x ;x := x −1


9 Intermediate code generation9.5 Generating P-code 331


P-code

9.5 Generating P-code

After having introduce the concept of p-code, including (relevant parts of) the instruction set, we have alook at code generation. Actually, it’s not very hard. We have a look at that problem from different angles:we make use of attribute grammars, look at some C-code implementation, and sketch also some code ina functional language. All three angles are basically equivalent. The focus here is on straight-line code.In other words, control-flow constructs (like conditionals and loops) are not covered right now. Those aretranslated making use of (conditional) jumps and labels. We will deal with those aspects later.

Expression grammar

Grammar

exp1 → id := exp2exp → aexp

aexp → aexp2 + factoraexp → factor

factor → ( exp )factor → numfactor → id

332 9 Intermediate code generation9.5 Generating P-code

(x:=x+3)+4

+

x:=

+

x 3

4

As mentioned, the grammar covers only expression and assignments, i.e., straight-line code, but no control-structures.

As a side remark: we said that the intermediate code generation takes typically abstract syntax. Typicalabstract syntax would not contain paretheses and the distinction between factors and terms etc. is moretypical for grammars covering concrete syntax and parsing. But the question, whether the grammardescribes typcially abstract or concrete syntax, is not too relevant for the principle of the translation here,and after all, one can use concrete syntax as abstract syntax trees, even if it often better design to makethe AST a bit more abstract. Anyway, we don’t bother to show the parentheses in the tree.

Generating p-code with A-grammars

• goal: p-code as attribute of the grammar symbols/nodes of the syntax trees• syntax-directed translation• technical task: turn the syntax tree into a linear IR (here P-code)⇒ – “linearization” of the syntactic tree structure

– while translating the nodes of the tree (the syntactical sub-expressions) one-by-one

• not recommended at any rate (for modern/reasonably complex language): code generation whileparsing4

The use of A-grammars is perhaps more a conceptual picture, In practice, one may not use a-grammarsand corresponding tools in the implementation. Remember that in many situations, the AST in a compileris a “just” a data structure programmed inside the chosen meta-language. For instance, in the compila lan-guage, most will have chosen a Java implementation making use of different abstract and concrete classes,perhaps making a visitor pattern and what not. Anyway, it’s not in a format directly represented to behandled by an attribute-grammar tool (though also that is possible). Anyway, realizing the semantic ruleswe show in a-grammar format in a programming language format, operating on the AST tree data struc-ture is not complex. In particular, since the attribute grammar is of a particularly simple format: it’s usesa synthesized attribute only (which is the simplest format). It works bottom-up or in a divide-and-conqueror compositinal manner: the code of a compound statement consist of compiling the substatements andconnecting the resulting translated code, with some additional commands. For expressions, the additionalinstructions are done at the end (“post-fix”), in more general situations, one encounters also pre-fix code(and sometimes even infix).

That captures the principle core of compilation, it better be compositional: to compile a large programmeans, to break it down into pieces, compile smaller pieces and the put the compiled pieces together forthe overall result.

The principle of compositionality or divide-and-conquer is perhaps so typical or natural for compilationin general, to appear as not even worth mentioning. That maybe so, but the principle applies only whenignoring optimization. Optimization breaks with the principle of compositionality, mostly. Taking two“optimized” pieces of generated code together in a divide-and-conquer manner will typically not result

4one can use the a-grammar formalism also to describe the treatment of ASTs, not concrete syntaxtrees/parse trees.


in an optimized overall piece of code. Optimization is done more “globally”, not compositional wrt. thesyntax structure of the program. That is plausible, because optimization tries to improve the code withoutchanging it’s semantics. The improvement may refer to the execution time or memory consumption (oreven the size of the code itself, which itself is not a semantic criterion, but the optimization must preservethe semantics, of course). The remarks here about compositionality of code generation and the non-compositionality of analysis and optimization is not particular for p-code generation. The same applies to3AIC generation and actually to compilation in general. The compilation part is typically compositionaland therefore efficient. Analysis and optimization(s) are done afterwards and depending on how muchone invests afterwards in analysing the result and how aggressive the optimizations are, that part may nolonger be efficient. By efficient I basically mean: linear (or at least polynomial) in the size of the inputprogram.

When saying, analysis and optimization is not compositional (unlike code generation), that probably shouldbe understood as a qualified, not absolute statement. It’s mostly not possible to invest in an absolutelyglobal analysis, it would be too costly. It may be “compositional” in respecting the user-level syntax in thatit does analyses each procedure individually, but tries not to make a global optimization across procedurebody boundaries. Or even simpler, the optimization focuses on stretches of straight-line code. For instance,if one translates a conditional, there will be in the translation some jumps and labels, but those mark theboundaries of the optimization. In a way, the two branches of a conditional are optimized independently,in that sense the optmization is composition as far as the user-level syntax is concerned, and one doesnot attempt to see if additional gains could be achieve to analyze both branches “globally”. These issues—analysis, optimization, and various levels of “globality” for that— will be relevant in the next chapter,where we discuss the ultimate code generation, not intermediate code generation.

A-grammar for statements/expressions

• focus here on expressions/assignments: leaving out certain complications• in particular: control-flow complications

– two-armed conditionals– loops, etc.

• also: code-generation “intra-procedural” only, rest is filled in as call-sequences• A-grammar for intermediate code-gen:

– rather simple and straightforwad– only 1 synthesized attribute: pcode

As mentioned, the code generated here is for straight-line code only and relatively simply, as can be seenon the a-grammar on the next slide.

A-grammar

• “string” concatenation: ++ (construct separate instructions) and ˆ (construct one instruction)

productions/grammar rules semantic rulesexp1 → id = exp2 exp1 .pcode = ”lda”ˆid.strval ++

exp2 .pcode ++ ”stn”exp → aexp exp .pcode = aexp .pcode

aexp1 → aexp2 + factor aexp1 .pcode = aexp2 .pcode++ factor .pcode++ ”adi”

aexp → factor aexp .pcode = factor .pcodefactor → ( exp ) factor .pcode = exp .pcodefactor → num factor .pcode = ”ldc”ˆnum.strvalfactor → id factor .pcode = ”lod”ˆnum.strval


The op-codes are marked in red. The generation is rather simple: it’s purely synthesized (which is arguablythe simplest form of AGs). It works purely bottom, divide and conquer. We are dealing with expressionsonly, and the code generation works similarly as the evaluation of expression (which works bottom-up).However, on the next slide we see, that it code generation works also when dealing with assignment(something that does not work any more when trying to do evaluation).

As discussed in the previous subsection, we see also the difference between l-values and r-values (lda andlod).

Linearization

Let’s address another small point here. As mentioned, we are dealing with a linear IR: like 3AIC andother formats, p-code is a linear IR. It is a language consisting of a linear sequence of simple commands(and uses jumps and labels for control, even though those parts are currently not in the focus). The taskof code generation (if one assume that one deals with control-structures as well) it to translate the non-linear tree structure into a linear one (justing jumps and labels). So, that may be called “linearization”.Since currently we don’t focus on the control-structures, the task is to translate an already linear language(“straight-line code”) to another linear arrangement, the linear P-code. We do so in the AG, assumingoperations like ˆ and ++ . The respesent appending an element to a list resp. concatenating two lists.However, strictly speaking ++ $ is a binary operation. We wrote in the semantic rules of the AG thingslike l1 ++ l2 ++ l3. We did not say how to “think” of that (like to parse it mentally). Is that left or rightassociative? Or do we mean that the reader understands that it does not really matter, as list concatenationis associative and we mean the resulting overall list, obviously. Sure, it should be clear. Note also, that++ is understood as separating two pieces of code from each other (one can think “newline” in codeexamples). Later, we show an implementation in a functional language, we use the cosntructor Seq forthat (for sequential composition). However, we don’t implement that as contatenation of list but as asimple cosntructor. Consequently, the result of that translation (which corrresponds to the AG here) isnot technically linear, it’s still a tree (of a simple structure). Therefore, in a last steps, one needs to flattenout the tree to a ultimate linear list. Why does one do so? Well, it may be more efficient that way:concatenating lists “on the fly” is typically not a tail-recursive procedure and thus not altogether cheap.So one may be better off by first doing another tree-like struction, flattened out afterward. It’s a commontechnique. And furtherore, if we would right now also consider conditionals and loops, etc. it’s harder tofind the ultimate linear sequence of commands while processing then abstract syntax. Also for that reason,one might be better off to first generate pieces of the code that are afterwards glued together in a lineararrangement.

But apart from those fine points, the implementation later reflects pretty truthfully the AG here.

(x := x + 3) + 4

Attributed tree

+

x:=

+

x 3

4

result

lod x ldc 3

lod xldc 3adi

ldc 4

lda xlod xldc 3adi 3stn


“result” attr.

lda xlod xldc 3adistnldc 4adi ; +

• note: here x=x+3 has side effect and “return” value (as in C . . . ):• stn (“store non-destructively”)

– similar to sto , but non-destructive1. take top element, store it at address represented by 2nd top2. discard address, but not the top-value

The issue of the semantics of an assignment has been mentioned earlier: does it give back a result or not.Before code was generated under the assumption no value is “returned”. Here, we interpret it different, inaccordance with languages like C or Java. There, we have to use the command stn instead of sto frombefore.

Implementation in a functional language

The following slides show how the intermediate code generation resp. the AG can be implemented straight-forwardly in a functional language. Later, we will see also how the code looks in C, which is also straight-forward. Though I believe the functional code is more concise.

We start defining the two syntaxes of the two language, the source code and the target code. There aremore or less one-to-one transscripts of the grammars we have seen.

Overview: p-code data structures

Source

type symbol = s t r i n g

type expr =| Var of symbol| Num of i n t| Plus of expr ∗ expr| Assign of symbol ∗ expr

Target

type i n s t r = (∗ p−code i n s t r u c t i o n s ∗)LDC of i n t

| LOD of symbol| LDA of symbol| ADI| STN| STO

type t r e e = Onel ine of i n s t r| Seq of t r e e ∗ t r e e

type program = i n s t r l i s t

• symbols:– here: strings for simplicity– concretely, symbol table may be involved, or variable names already resolved in addresses etc.


In the target syntax, there are two “stages”: a program is a linear list of instructions, but there is alsothe notion of “tree”: the leaves of the trees are “one-line” instructions and trees can be combined usingsequential composition. Consequently, the translation (on the next slide) will also have 2 stages: the firstone (which is the interesting one) generates a tree, and the second one flattens out the tree or “combs it”into a list.

Two-stage translation

v a l to_tree : A s t e x p r a s s i g n . expr −> Pcode . t r e e

v a l l i n e a r i z e : Pcode . t r e e −> Pcode . program

v a l to_program : A s t e x p r a s s i g n . expr −> Pcode . program

l e t rec to_tree ( e : expr ) =match e with| Var s −> ( Onel ine (LOD s ) )| Num n −> ( Onel ine (LDC n ) )| Plus ( e1 , e2 ) −>

Seq ( to_tree e1 ,Seq ( to_tree e2 , Onel ine ADI) )

| Assign ( x , e ) −>Seq ( Onel ine (LDA x ) ,

Seq ( to_tree e , Onel ine STN) )

l e t rec l i n e a r i z e ( t : t r e e ) : program =match t with

Onel ine i −> [ i ]| Seq ( t1 , t2 ) −> ( l i n e a r i z e t1 ) @ ( l i n e a r i z e t2 ) ; ; // l i s t concat

l e t to_program e = l i n e a r i z e ( to_tree e ) ; ;

The code makes more visible, that operations like ++ used in the AG are binary, the AG generates atree rather then a sequence. Nonetheless, flattening out the tree in a second step (linearize) is child’splay. As mentioned earlier, in connection with that AG: it would be straightforward not to have these 2stages: instead of using Seq for doing the trees first, one could use directly list-append. Appending listsin functional languages is not tail-recursive and one may be better off, efficiency-wise, to split it into twostages as shown.

Next we do the same implementation in C. We start by showing a possible way to represent ASTs. We haveseens similar representations in earlier chapters. We have also seen ways to represent such trees in Javawhere we operated with concrete classes as beeing subclasses of abstract classes. Here, the data structureuses enumeration types and structs.

Source language AST data in C

• remember though: there are more dignified ways to design ASTs . . .


Code-generation via tree traversal (schematic)

procedure genCode(T: t r e e n o d e )begin

i f T $\not = $ n i lthen

`` g e n e r a t e code to prepare for code for l e f t c h i l d ' ' // p r e f i xgenCode ( l e f t c h i l d of T ) ; // p r e f i x ops`` g e n e r a t e code to prepare for code for r i g h t c h i l d ' ' // i n f i x

genCode ( r i g h t c h i l d of T ) ; // i n f i x ops`` g e n e r a t e code to implement a c t i o n ( s ) for T' ' // p o s t f i x

end ;

This sketch of a code skeleton basically says: the code generation is a recursive procedure, and it involvesprefix-actions, post-fix actions and maybe even infix-actions. By actions I mean generating or emitingp-code commands. Looking at the functional code we can see that there was no code generated in infix-position, so we can expect to see no such thing in the C-code as well. The sketched skeleton just is justgeneral, there may be other situations more complex that the ASTs covered here that would call for infixcode. We, at least don’t make use of it.

Code generation from AST+

• main “challenge”: linearization• here: relatively simple• no control-flow constructs• linearization here (see a-grammar):

– string of p-code– not necessarily the ultimate choice (p-code might still need translation to “real” executable

code)

preamble code

calc. of operand 1

fix/adapt/prepare ...

calc. of operand 2

execute operation

Code generation

The code generation works in principle the same as in the functional implementation (and the AG), ofcourse. In the functional implementation from before, we have choosen not to emit strings already. Insteadwe have chosen to construct an element of a data structure representing the instructions of the p-code (wecalled the type instr). Given the fact that we are not yet at the “real” code level, but at an intermediatestage, generating a data structure is more realistic and better than generating a string. A string wouldhave to be parsed again etc., and operating on strings is always more error prone (typos) than operatingon constructors of a data structure.

Not that reparsing strings would be hard. Also for debugging reasons a compiler could have the option toemit a “pretty-printed” version of the intermediate code (or some other external exchange format), but awell-designed internal representation is, for various reasons, the more dignified and realistic way of handingthings over to the next stage.

338 9 Intermediate code generation9.6 Generation of three address code

9.6 Generation of three address code

This section does the analogous thing we have done for p-code (one-address code).

3AC manual translation again

Source



f a c t := f a c t ∗ x ;x := x −1



Target: 3AC

9 Intermediate code generation9.6 Generation of three address code 339

read xt1 = x > 0i f _ f a l s e t1 goto L1f a c t = 1label L2t2 = f a c t ∗ xf a c t = t2t3 = x − 1x = t3t4 = x == 0i f _ f a l s e t4 goto L2write f a c tlabel L1halt

In this section, as we did for the p-code, we focus on straight-line code, though the example shows alsohow conditionals and loops are treated (which we cover later). As far as the treatment for the latterconstructs is concerned, the p-code generation and the 3AIC code generation works analogously anyway.In the translated target code for the faculty, we see also here labelling commands (pseudo-instructions)and (conditional) jumps, as in the target code when translated to p-code.

Implementation in a functional language

We do the same as for the p-code and show how to realize the code generation in some functional language(ocaml). The source language, expressions in the abstract syntax tree and assignments, are unchanged(the abstract grammar was shown on page 331). In the following, we start by repeat the data structure forthe source language (which is unchanged) and showing the data structures for the target language similarwhat we did for the p-code. The data structure can be seen as “abstract syntax” for the 3AIC. One canalso see: the 3AIC data structure covers more than we (currently) actually need. There is branching andlabels. There is also something that deals with using arrays in assignment. More complex data structureslike array accesses and indexed access will be coverered later as well, but not right now. page

Three-address code data structures (some)

Data structures (source)

type symbol = s t r i n g

type expr =| Var of symbol| Num of i n t| Plus of expr ∗ expr| Assign of symbol ∗ expr

Data structures (target)

type mem =Var of symbol

| Temp of symbol| Addr of symbol (∗ &x ∗)

type operand = Const of i n t| Mem of mem

type cond = Bool of operand| Not of operand| Eq of operand ∗ operand| Leq of operand ∗ operand| Le of operand ∗ operand

type rhs = Plus of operand ∗ operand| Times of operand ∗ operand| Id of operand

type i n s t r =Read of symbol

| Write of symbol


| Lab of symbol (∗ pseudo i n s t r u c t i o n ∗)| Assign of symbol ∗ rhs| AssignRI of operand ∗ operand ∗ operand (∗ a := b [ i ] ∗)| AssignLI of operand ∗ operand ∗ operand (∗ a [ i ] := b ∗)| BranchComp of cond ∗ l a b e l| Halt| Nop

type t r e e = Onel ine of i n s t r| Seq of t r e e ∗ t r e e

type program = i n s t r l i s t

• symbols: again strings for simplicity• again “trees” not really needed (for simple language without more challenging control flow)

The data structure for the target language does the same two layers we used for the p-code. One “tree”representation that connects single-line instructions using Seq, and a linear list of instructions as the finalrepresentation.

Translation to three-address code

l e t rec to_tree ( e : expr ) : t r e e ∗ temp =match e with

Var s −> ( Onel ine Nop , s )| Num i −> ( Onel ine Nop , s t r i n g _ o f _ i n t i )| Ast . Plus ( e1 , e2 ) −>

(match ( to_tree e1 , to_tree e2 ) with( ( c1 , t1 ) , ( c2 , t2 ) ) −>

l e t t = newtemp ( ) in( Seq ( Seq ( c1 , c2 ) ,

Onel ine (Assign ( t ,

Plus (Mem(Temp( t1 ) ) ,Mem(Temp( t2 ) ) ) ) ) ) ,t ) )

| Ast . Assign ( s ' , e ' ) −>l e t ( c , t2 ) = to_tree ( e ' )in ( Seq ( c ,

Onel ine ( Assign ( s ' ,Id (Mem(Temp( t2 ) ) ) ) ) ) ,

t2 )

For the code generation, we focus on the translation of the part we are currently interested in, assignmentsand expressions, leaving out the other complications. We see the generation of new temporaries using afunction newtemp. The implementation is not shown, but is easy enough (simply using a counter thatgenerates a new number at each invokation and returning a correspinding temporary). Strictly speaking,such a counter is not purely functional. That’s not a problem, must functional languages are not purelydeclarative, and one can implement such a generating function and other imperative things. Later, we lookat a corresponding AG. Normally, an attribute grammar (as a theoretical construct) is purely declarativeor functional, which means no side-effect. Still, we will allow ourselves in the AG a function like newtempfor convenience.

In principle, one could do a fully functional representation (here in the code as well as in the AG later),simply adding an additional argument, for instance a integer counter that is appropriately handed over.That does not add to the clarity to the code, so a generator like newtemp is more concise, it would seem.

An interesting aspect of the code generator is it’s type, resp. it’s return type. It returns, obviously, 3AIC,more precisely a “tree” of 3AIC instructions. However, it also returns an element of type temp. Thisone is needed, because in order to generate code for compound statements, one needs to know where tofind the results of the translation of the sub-expressions. That can be seen, for instance, in the case foraddition.

The two recursive calls on the subexpressions of the addition give back a tuple each, i.e., one has two pairsof information; see the correponding match-expression in the code. The resulting code is constructed astrees, and the result is given back in temporaries t1 and t2 (or t1 and t2 in the code). Then the last 3AICline generated in the addition-case is t := t1 + t2, where t is a new temporary, and the function return thepair of the code together with this freshly generated t.

9 Intermediate code generation9.6 Generation of three address code 341

Three-address code by synthesized attributes

• similar to the representation for p-code• again: purely synthesized• semantics of executing expressions/assignments5

– side-effect plus also– value

• two attributes (before: only 1)– tacode: instructions (as before, as string), potentially empty– name: “name” of variable or tempary, where result resides6

• evaluation of expressions: left-to-right (as before)

A-grammar

productions/grammar rules semantic rulesexp1 → id = exp2 exp1 .name = exp2 .name

exp1 .tacode = exp2 .tacode ++id.strvalˆ”=”ˆ exp2 .name

exp → aexp exp .name = aexp .nameexp .tacode = aexp .tacode

aexp1 → aexp2 + factor aexp1 .name = newtemp()aexp1 .tacode = aexp2 .tacode ++ factor .tacode ++

aexp1 .nameˆ”=”ˆ aexp2 .nameˆ”+”ˆ factor .name

aexp → factor aexp .name = factor .nameaexp .tacode = factor .tacode

factor → ( exp ) factor .name = exp .namefactor .tacode = exp .tacode

factor → num factor .name = num.strvalfactor .tacode = ””

factor → id factor .name = num.strvalfactor .tacode = ””

As mentioned, we allow ourselves here a function newtemp() to generate a new temporary in the case ofaddition, even if, super-strictly speaking, that’s not covered by AGs which are introduced as declarative,side-effect free formalism. But doing it purely functional (which is possible) would not add to understandhow 3AIC is generated.

Another sketch of TA-code generation

switch kind {case OpKind :

switch op {case Plus : {

tempname = new temorary name ;varname_1 = r e c u r s i v e c a l l on l e f t subt ree ;varname_2 = r e c u r s i v e c a l l on r i g h t subt ree ;emit ( " tempname = varname_1 + varname_2 " ) ;return ( tempname ) ; }

5That’s one possibility of a semantics of assignments (C, Java).6In the p-code, the result of evaluating expression (also assignments) ends up in the stack (at the top).Thus, one does not need to capture it in an attribute.


case Assign : {varname = id . for va r i ab l e on l h s ( in the node ) ;varname 1 = r e cu r s i v e c a l l in l e f t subt ree ;emit ( " varname = opname" ) ;return ( varname ) ; }

}case ConstKind ; { return ( constant−s t r i n g ) ; } // emit nothingcase IdKind : { return ( i d e n t i f i e r ) ; } // emit nothing

}

• “return” of the two attributes– name of the variable (a temporary): officially returned– the code: via emit

• note: postfix emission only (in the shown cases)

Generating code as AST methods

• possible: add genCode as method to the nodes of the AST• e.g.: define an abstract method String genCodeTA() in the Exp class (or Node, in general all

AST nodes where needed)

St r ing genCodeTA ( ) { St r ing s1 , s2 ; S t r ing t = NewTemp ( ) ;s1 = l e f t .GenCodeTA ( ) ;s2 = r i gh t .GenCodeTA ( ) ;emit ( t + "=" + s1 + op + s2 ) ;return t

}

ASTs are trees, of course, and we have seen how one can realize the AST data structure in object-oriented,class-based languages, like Java etc., and probably most have chosen a corresponding reprentation in oblig1. Of course, recursion over such data structure can be done straightforward, by adding a correspondingmethod. That’s object-orientation “101”: one adds a corresponding method to the classes, whose instancesrepresent different nodes in the trees, and then calls them recursively, as shown in the code sketch.

Whether it is a good design from the perspective of modular compiler architecture and code maintenance,to clutter the AST with methods for code generation and god knows what else, e.g. type checking, prettyprinting, optimization . . . , is a different question.

A better design, many would posit, is in this situation to separate the functionality from the tree structure,i.e., to separate the “algorithm” from the “data structure”, not embedd the algorithm. Such a separationcan be achieved in Java-like OO languages but a design-pattern called visitor. It allows to iterate overrecurive stuctures “from the outside”. It’s a better design in our context of compilers; it allows to separatedifferent modules from the central data structure and intermediate representation of ASTs (and might beuseful for other intermediate representations as well). Since this is not a lecture about Java or C++ designpatterns, but about (principles of) compilers, so we leave it like at that, especially since the “embeddedsolution” shown on the slide works ok as well. Some groups for oblig 1 (2020, and previous years), however,actually did the effort to realize the print-function as visitor.

9 Intermediate code generation9.7 Basic: From P-code to 3A-Code and back: static simulation & macro expansion 343

Attributed tree (x:=x+3) + 4

• note: room for optimization

To conclude this section, here the generated code for the example we have seen before, presented asattributes from the AG.

9.7 Basic: From P-code to 3A-Code and back: static simulation& macro expansion

In this intermezzo we shortly have a look how to translater back and forth between the two differentintermediate code formats, 1-address-code and 3AC. We do that mainly to touch upon two concepts,macro-expansion and static simulation. The first is one rather straightforward, the static simulation is amore complex topic.

Apart from the fact that those mentioned concepts are interesting also in contexts different from the onewhere they are discussing here, one may still ask: why would one want to translate 1AIC to 3AIC andback (beyond using the translations as illustrating some concepts)?

Well, notions of 1AC and 3AC exist also independent from their use as intermediate code. In particular,hardware may offer an instruction set in 3A-format, or at least partly in 3A-format (or 2A-format). 1A-hardware, though, is nowadays non-existant (there had been attemps for that in the past). So, if one hasan intermediate representation like the p-code or 1AIC as presented here, then generating code for a 3AChardware faces problems as discussed here. Final code generation faces additional problems (like platform-dependent optimization, and register allocation, which will not enter the picture here. For the ultimatecode generation, we will probably translated from 3AIC to 2AC machine code, which is not directly coveredin this section here, but anyway, our focus later will be on the register allocation anyway.

“Static simulation”

• illustrated by transforming p-code ⇒ 3AC• restricted setting: straight-line code• cf. also basic blocks (or elementary blocks)

– code without branching or other control-flow complications (jumps/conditional jumps. . . )– often considered as basic building block for static/semantic analyses,– e.g. basic blocks as nodes in control-flow graphs, the “non-semicolon” control flow constructs

result in the edges• terminology: static simulation seems not widely established• cf. abstract interpretation, symbolic execution, etc.

344 9 Intermediate code generation9.7 Basic: From P-code to 3A-Code and back: static simulation & macro expansion

The term “static simulation” seems like an oxymoron, a contradicton in itself. Simulation sounds likerunning a program, and static means, at compile time, before running a program. And, due to fundamentallimitation (undecidablity of the halting problem), the compiler in general cannot simulate a program (forreasons of analysis or, here specifically, for translating it to a different representation). However, here weare in the quite restricted situation: straight-line code (especially no loops), which means the programterminates anyway, actually, the number of steps it does is known, it’s the number of lines. So it’s a finiteproblem, there are no issues with undecidability. Being finite, one can execute “mentally” one commandafter the other and know what will happen when running the program. That’s what the compiler does forthe translation and one can call it static simulation.

P-code ⇒ 3AIC via “static simulation”

• difference:– p-code operates on the stack– leaves the needed “temporary memory” implicit

• given the (straight-line) p-code:– traverse the code = list of instructions from beginning to end– seen as “simulation”

∗ conceptually at least, but also∗ concretely: the translation can make use of an actual stack

From P-code ⇒ 3AIC: illustration

The slide illustrates the concept on a simple example x := (x+3) + 4 (which we have seen before). Thecode on the top of the left-hand side is the target code, the p-code instructions. the right-hand side showsthe evolution of the abstract p-code machine, when executing the p-code on the left. In particular, thestack as the crucial part is shown in its evolution, not after every single line having been executed, but atcrucial intermediate stages. One such stages is after having done adi, for instance the first such instance.As discussed, the stack machine uses the stack for intermediate results, that’s exactly what happens whenexecuting adi (or similar operations): the operands are popped of the stack, and the intermediate result isstored on the stack (“push”). Without stack, the 3AIC needs to store that intermediate result somewhereelse, and that’s of course a (new) temporary. Note also: the semantics of the abstract syntax is assumedto be that an assignment (like x := x +3 in the example) gives back a value, like on C or Java. That isreflected in the p-code by using stn, the non-destructive storing, as discussed earlier. In the translationto 3AIC, the right-hand side is stored in t1, and that is used in the last line t2 := t1 + 3.

P-code ⇐ 3AIC: macro expansion

• also here: simplification, illustrating the general technique, only• main simplification:

– register allocation


– but: better done in just another optmization “phase”

The inverse direction of the translation is simpler, at least when doing it in a simple way. It does not needany static simulation of the architecture, i.e., considering the program’s semantic, it can work simply onthe syntactic structure of the input program. It simple expands each line by a corresponding sequence ofp-code instructions. The is illustrated on the basic 3AIC instruction on the next slide and afterwards onthe previous example.

Macro for general 3AIC instruction: a := b + c

lda alod b ; or `` ldc b ' ' i f b i s a constlod c : or `` ldc c ' ' i f c i s a constadisto

Example: P-code ⇐ 3AIC ((x:=x+3)+4)

There are two different p-codes shown, translated in different ways. One indirectly, via the 3AIC, whichis macro-expanded as illustrated. The second p-code is generated directly from the abstract syntax code.Clearly, the directly translated code is quite much shorter (and more efficient). One important factorin that “loss” in the indirect translation is that the macro-expansion is “brainless”. That’s makes theexpansion simple and efficient, but at the price is that the resulting code is not efficient when beingexecuted. We will, in the following at least hint how to do it better. In general, however, generatingefficiently non-efficient (but correct) code that is afterwards optimized is not per se a bad idea. Thatcommon place in many compilers (even if compilers might not compiler back-and-forth 1AIC and 3AIC).Anyway, the “better” translation we will look at improves on one piece of inefficiency (in the example).The 3AIC contains a line x = t1. After that x and t1 contain obviously the same value. The macroexpansion “mindlessly” expands this line, even though one does not need to have two copies of the valuearound. More generally, the translation does not keep track of which values are stored where, it workspurely line-by-line and syntactically. That can be improved, in “static-simulation” style.

In a preview of code generation in the last chapter: similar information, which value is stored where, inparticular in which register and which main-memory address, that style of information tracking will beemployed in that context later as well.

source 3AI-code

t1 = x + 3x = t1t2 = t1 + 4

Direct p-code


346 9 Intermediate code generation9.7 Basic: From P-code to 3A-Code and back: static simulation & macro expansion

P-code via 3A-code by macro exp.

;−−− t1 = x + 3lda t1lod xldc 3adisto;−−− x = t1lda xlod t1sto;−−− t2 = t1 + 4lda t2lod t1ldc 4adisto

cf. indirect 13 instructions vs. direct: 7 instructions

Indirect code gen: source code ⇒ 3AIC ⇒ p-code

• as seen: detour via 3AIC leads to sub-optimal results (code size, also efficiency)• basic deficiency: too many temporaries, memory traffic etc.• several possibilities

– avoid it altogether, of course (but remember JIT in Java)– chance for code optimization phase– here: more clever “macro expansion” (but sketch only)

the more clever macro expansion: some form of static simulation again

• don’t macro-expand the linear 3AIC– brainlessly into another linear structure (p-code), but– “statically simulate” it into a more fancy structure (a tree)

“Static simulation” into tree form (sketch)

• more fancy form of “static simulation” of 3AIC• result: tree labelled with

– operator, together with– variables/temporaries containing the results

Source

t1 = x + 3x = t1t2 = t1 + 4


Tree

+

+

x 3

4

t2

x,t1

note: instruction x = t1 from 3AIC: does not lead to more nodes in the tree

P-code generation from the generated tree

Tree from 3AIC

+

+

x 3

4

t2

x,t1

Direct code = indirect code


• with the thusly (re-)constructed tree⇒ p-code generation

– as before done for the AST– remember: code as synthesized attributes

• the “trick”: reconstruct essential syntactic tree structure (via “static simulation”) from the 3AI-code• Cf. the macro expanded code: additional “memory traffic” (e.g. temp. t1)

Compare: AST (with direct p-code attributes)

+

x:=

+

x 3

4

result

lod x ldc 3

lod xldc 3adi

ldc 4

lda xlod xldc 3adi 3stn

348 9 Intermediate code generation9.8 More complex data types

9.8 More complex data types

Next we drop one of the simplifications we have done so far, concerning the involved data. We have alock at how to lift the other simplification, lack of control-flow commands, later. As far as the data isconcerned, we have treated only variables (and temporaries) for simple data types, but not compound ones(arrays, records etc.). Also, we have not looked at referenced data (pointers). To deal with that adequately,intermediate languages support additional ways to access data, i.e., additinal addressing modes. A tasteof that we have seen in the p-code: a variable can be loaded in two different ways, depending on whetherthe variable is used as l-value or r-value. The two commands are lod and lda, load the variable’s valueor load the variable’s address.

Status update: code generation

• so far: a number of simplifications• data types:

– integer constants only– no complex types (arrays, records, references, etc.)

• control flow– only expressions and– sequential composition⇒ straight-line code

Address modes and address calculations

• so far– just standard “variables” (l-variables and r-variables) and temporaries, as in x = x + 1– variables referred to by their names (symbols)

• but in the end: variables are represented by addresses• more complex address calculations needed

addressing modes in 3AIC:

• &x: address of x (not for temporaries!)• *t: indirectly via t

addressing modes in P-code

• ind i: indirect load• ixa a: indexed address

The concepts underlying the commands here are typically also supported by standard hardware. Theremay be special registers for indexed access, to make that form of access fast. Indexed access (here inp-code) is an access which has two arguments: the address of some place (in memory) and an offset. Thatshould remind us to the way that arrays are layed out in memory (we had discussed that earlier). Indeed,HW-supported indexed access is one important reason, that arrays are a very efficient data structure. Wewill illustrate the new constructions on arrays (but also records) in the following.

In the 3AIC, we don’t have indexed addressing, one has C-like address, with access to the addresses ofvariables. The &x operation corresponds to the lda instruction in p-code.

Loading indirectly (in 3AIC and 1AIC) means: load not the content of the variable and that’s it (nor loadits address): load the content of the variable (or here the temporary), interpret the loaded value as address,and then, load from there. Similarly when using *t on the left-hand side of a 3AIC assignments.

9 Intermediate code generation9.8 More complex data types 349

Address calculations in 3AIC: x[10] = 2

• notationally represented as in C• “pointer arithmetic” and address calculation with the available numerical ops

t1 := &x + 10∗ t1 := 2

• 3-address-code data structure (e.g., quadrupel): extended (adding address mode)

The compilation is straightforward. The code also shows, that (at least in our 3AIC) there is no indexedaccess. The off-set, in the example 10 is calculated in by 3AIC instructions. It’s a form of “pointerarithmetic”. We will revisit the example in p-code; there, the translation will make use of an indexedaccess command ixa.

Address calculations in P-code: x[10] = 2

• tailor-made commands for address calculation

• ixa i: integer scale factor (here factor 1)

lda xldc 10ixa 1ldc 2sto


The two introduced commands ixa and ind are “explained” by showing their corresponding representationon the right-hand side of the slides. The two commands correspond to a situation, where a array expressionis written-to (ind) resp. read-from (ixa). The difference correspond to the notions of l-values and r-values,we have seen before (but not in the context of array accesses). Also on the next slide, we see the differencebetween the two flavors of array-accesses (l- vs- r-value usage).

In the two pictures, the a is mnonic for a value representing an address. In the code example: The ixacommand expects two argument on the stack (and has as third argument the scale factor as part of thecommand. To make use of the command, we first load the address of x loaded and afterwards constant10. Executing then the ixa 1 command yields does the calculation in the box, which is intended asaddress calculation. So the result of that calculation is (intended as) an address again. To that address,the constant 2 is stored (and the values discared from the stack: sto is the “destructive” write).

Array references and address calculations

int a [ SIZE ] ; int i , j ;a [ i +1] = a [ j ∗2 ] + 3 ;

• difference between left-hand use and right-hand use• arrays: stored sequentially, starting at base address• offset, calculated with a scale factor (dep. on size/type of elements)• for example: for a[i+1] (with C-style array implementation)7

a + (i+1) * sizeof(int)

• a here directly stands for the base address

Array accesses in 3AI code

• one possible way: assume 2 additional 3AIC instructions• remember: 3AIC can be seen as intermediate code, not as instruction set of a particular HW!• 2 new instructions8

t2 = a [ t1 ] ; f e t ch value o f array element

a [ t2 ] = t1 ; a s s i gn to the address o f an array element

Source code

a [ i +1] = a [ j ∗2 ] + 3 ;

7In C, arrays start at a 0-offset as the first array index is 0. Details may differ in other languages.8Still in 3AIC format. Apart from the “readable” notation, it’s just two op-codes, say =[] and []=.


TAC

t1 = j ∗ 2t2 = a [ t1 ]t3 = t2 + 3t4 = i + 1a [ t4 ] = t3

We have mentioned that IC is an intermediate representation that may be more or less close to actualmachine code. It’s a design decision, and there are trade-offs either way. Like in this case: obviously it’s(slightly) easier to translate array accesses to a 3AIC which offers such array accesses itself (like on thisslide). It’s, however, not too big a step to do the translation without this extra luxury. In the followingwe see how to do exactly that, without those array-accesses at the IC level (both for 3AIC as well as forP-code). That’s done by macro-expansion, something that we touched upon earlier. The fact that one can“expand away” the extra commands shows there are no real complications either way (with or withoutthat extra expressivity).

One interesting aspect, though, is the use of the helper-function elem_size. Note that this depends onthe type of the data structure (the elements of the array). It may also depend on the platform, whichmeans, the function elem_size is (at the point of intermediate code generation) conceptually not yetavailable, but must provided and used when generating platform-dependent code. As similar “trick” wewill see soon when compiling record-accesses (in the form of a function field_offset.

As a side remark: syntactic constructs that can be expressed in that easy way, by forms of macro-expansion,are sometimes also called “syntactic sugar”.

Or “expanded”: array accesses in 3AI code (2)

Expanding t2=a[t1]

t3 = t1 ∗ e lem_size ( a )t4 = &a + t3t2 = ∗ t4

Expanding a[t2]=t1

t3 = t2 ∗ e lem_size ( a )t4 = &a + t3∗ t4 = t1

• “expanded” result for a[i+1] = a[j*2] + 3

t1 = j ∗ 2t2 = t1 ∗ e lem_size ( a )t3 = &a + t2t4 = ∗ t3t5 = t4 +3t6 = i + 1t7 = t6 ∗ e lem_size ( a )t8 = &a + t7∗ t8 = t5


Array accessses in P-code

Expanding t2=a[t1]

lda t2lda alod t1ixa e lement_size ( a )ind 0sto

Expanding a[t2]=t1

lda alod t2ixa e lem_size ( a )lod t1sto

• “expanded” result for a[i+1] = a[j*2] + 3

lda alod ildc 1adiixa elem_size ( a )lda alod jldc 2mpiixa elem_size ( a )ind 0ldc 3adisto

Extending grammar & data structures

• extending the previous grammar

exp → subs = exp2 | aexpaexp → aexp + factor | factor

factor → ( exp ) | num | subssubs → id | id [ exp ]


Syntax tree for (a[i+1]:=2)+a[j]

+

=

a[]

+

i 1

2

a[]

j

Code generation for P-code

The next slides show (as C code) how one could generate code for the “array access” grammar from before.Compared to the procedures for code generation before, the procedure has one additional argument, aboolean flag. That has to do with the discinction we want to make (here) whether the argument is to beinterpeted as address or not. And that in turn is related between so called L-values and R-values and thefact that the grammar allows “assignments” (written x = exp2) to be expressions themsevlves. In thecode generation, that is reflected also by the fact we use stn (non-destructive writing).

Otherwise: compare the code snippet from the earlier slides about “Array accesses in P-code”.

Code generation for P-code (op)

void genCode ( SyntaxTree t , int i sAddr ) {char c o d e s t r [ CODESIZE ] ;/∗ CODESIZE = max l e n g t h o f 1 l i n e o f P−code ∗/i f ( t != NULL) {

switch ( t−>kind ) {case OpKind :

{ switch ( t−>op ) {case Plus :

i f ( i s A d d r e s s ) emitCode( " Error " ) ; // new c h e c ke l s e { // unchanged

genCode( t−>l c h i l d , FALSE ) ;genCode( t−>r c h i l d , FALSE ) ;emitCode( " adi " ) ; // a d d i t i o n

}break ;

case Assign :genCode( t−>l c h i l d ,TRUE) ; // `` l−v a l u e ' 'genCode( t−>r c h i l d , FALSE ) ; // ``r−v a l u e ' 'emitCode( " stn " ) ;

Code generation for P-code (“subs”)

• new code, of course

case Subs :s p r i n t f ( c o d e s t r i n g , "%s %s " , " lda " , t−>s t r v a l ) ;emitCode( c o d e s t r i n g ) ;genCode( t−>l c h i l d . FALSE ) ;s p r i n t f ( c o d e s t r i n g , "%s %s %s " ,

" i x a elem_size ( " , t−>s t r v a l , " ) " ) ;emitCode( c o d e s t r i n g ) ;i f ( ! i sAddr ) emitCode( " ind 0 " ) ; // i n d i r e c t l o a dbreak ;

default :emitCode( " Error " ) ;break ;


Code generation for P-code (constants and identifiers)

case ConstKind :i f ( i sAddr ) emitCode( " Error " ) ;e l s e {

s p r i n t f ( c o d e s t r , "%s %s " , " l d s " , t−>s t r v a l ) ;emitCode( c o d e s t r ) ;

}break ;

case IdKind :i f ( i sAddr )

s p r i n t f ( c o d e s t r , "%s %s " , " lda " , t−>s t r v a l ) ;e l s e

s p r i n t f ( c o d e s t r , "%s %s " , " lod " , t−>s t r v a l ) ;emitCode( c o d e s t r ) ;break ;

default :emitCode( " Error " ) ;break ;

}}

}

Access to records

Let’s have also a short look to records. One may consult also the remarks when discussing types resp. thememory layout for different data types (in connection with the run-time environment). But the layour isrepeated here on the slides. Records are not much more complex that arrays, it’s only that the differentslots are not “uniformely” sized. This one cannot simply access “slot number 10” (using indexed accessor pointer arithmetic). Luckily, however, the offsets are all statically known (by the compiler), and withthat, one can access the corresponding slot.

One complication is: the offset may be statically known (before running the program), but actually notyet right now, in the intermediate code phase. It typically may be known only when having decided for theplatform. That’s still at compiler-time, but lies “in the future” in the phased design of the compiler. It’snot hard to solve that. Instead of generating a concrete offset right now, one injects some “function” (sayfield_offset) whose implementation (resp. expansion) will be done later, as part of fixing platform-dependent details. It’s similar what we used already in the context of the array-accesses, which made useof a function elem_size.

C-Code

typedef struct r e c {int i ;char c ;int j ;

} Rec ;. . .

Rec x ;


Layout

• fields with (statically known) offsets from base address• note:

– goal: intermediate code generation platform independent– another way of seeing it: it’s still IR, not final machine code yet.

• thus: introduce function field_offset(x,j)• calculates the offset.• can be looked up (by the code-generator) in the symbol table⇒ call replaced by actual off-set

Records/structs in 3AIC

• note: typically, records are implicitly references (as for objects)• in (our version of a) 3AIC: we can just use &x and *x

simple record access x.j

t1 = &x + f i e l d _ o f f s e t ( x , j )

left and right: x.j := x.i

t1 = &x + f i e l d _ o f f s e t ( x , j )t2 = &x + f i e l d _ o f f s e t ( x , i )∗ t1 = ∗ t2

The second example shows record access a l-value and as r-value.

Field selection and pointer indirection in 3AIC

Intro

Next we cover an pointer indirection, actually in connection with records. In C-like languages, that’s theway one can implement recursive data structure (which makes it an important programming pattern).Of course, in languages without pointers, which may support inductive data types for instance, thosestructures need to be translated similarly. The C-code shows a typical example, a tree-like data structure.The following snippets then two typical examples making use of such trees, one on the left-hand side, oneon the right-hand side of an assignment. The notation -> is C-specific, here used to “move” up or downthe tree. The same example (the tree) will also be used to show the p-code translation afterwards.

356 9 Intermediate code generation9.9 Control statements and logical expressions

C code

typedef struct treeNode {int v a l ;struct treeNode ∗ l c h i l d ,

∗ r c h i l d ;} treeNode. . .

Treenode ∗p ;

Assignment involving fields

p −> l c h i l d = p ;p = p−>r c h i l d ;

3AICt1 = p + f i e l d _ a c c e s s (∗ p , l c h i l d )∗ t1 = pt2 = p + f i e l d _ a c c e s s (∗ p , r c h i l d )p = ∗ t2

Structs and pointers in P-code

• basically same basic “trick”• make use of field_offset(x,j)

3AIC

p −> l c h i l d = p ;p = p−>r c h i l d ;

lod pldc f i e l d _ o f f s e t (∗ p , l c h i l d )ixa 1lod pstolda plod pind f i e l d _ o f f s e t (∗ p , r c h i l d )sto

9.9 Control statements and logical expressions

So far, we have dealt with straight-line code only. The main “complication” were compound expression,which do not exist in the intermediate code, neither in 3AIC nor in the p-code. That required theintroduction of temporaries resp. the use of the stack to store those intermediate results. The core additionto deal with control statements is the use of labels. Labels can be seen as “symbolic” respresentations of“programming lines” or “control points”. Ultimately, in the final binary, the platform will support jumpsand conditional jumps which will “transfer” control (= program pointer) from one address to another,“jumping to an address”. Since we are still at an intermediate code level, we do jumps not to real addressesbut to labels (referring to the starting point of seqquences of intermediate code). As a side remark: alsoassembly language editors will in general support labels to make the program at least a bit more human-readable (and relocatable) for an assembly programmer. Labels and goto statements are also known in

9 Intermediate code generation9.9 Control statements and logical expressions 357

(not-so-)high-level languages such as classic Basic (and even Java has goto as reserved word, even if itmakes no use of it).

Besides the treatment of control constructs, we discuss a related issue namely a particular use of booleanexpressions. It’s discussed here as well, as (in some languages) boolean expression can behave as control-constructs, as well. Consequently, the translation of that form of booleans, require similar mechanisms(labels) as the translation of standard-control statements. In C-like languages, that’s know as short-circuiting.

As a not-so-important side remark: Concretely in C, “booleans” and conditions operate also on more thanjust a boolean two valued domain (containting true and false or 0 and 1). In C, “everything” that’snot 0 is treated as 1. That may sounds not too “logical” but reflects how some hardware instructions andconditional jumps work. Doing some operations sets “ hardware flags” which then are used for conditionaljumps: jump-on-zero checks whether the corresponds flag is set accordingly. Furthermore, in functionallanguges, the phenomenon also occurs (but typically not called short-circuiting), and in general there, thedividing line between control and data is blurred anyway.

Control statements

• so far: basically straight-line code• general (intra-procedural) control more complex thanks to control-statements

– conditionals, switch/case– loops (while, repeat, for . . . )– breaks, gotos, exceptions . . .

important “technical” device: labels

• symbolic representation of addresses in static memory

• specifically named (= labelled) control flow points• nodes in the control flow graph

• generation of labels (cf. also temporaries)

Intra-procedural means “inside” a procedure. Inter-procedural control-flow refers to calls and returns,which is handled by calling sequences (which also maintain, in standard C-like languages the call-stack ofthe RTE.

Concerning gotos: gotos (if the language supports them) are almost trivial in code generation, as theyare basically available at machine code level. Nonetheless, they are “considered harmful”, as they messup/break abstractions and other things in a compiler/language.

Loops and conditionals: linear code arrangement

if -stmt → if ( exp ) stmt else stmtwhile-stmt → while ( exp ) stmt

• challenge:– high-level syntax (AST) well-structured (= tree) which implicitly (via its structure) determines

complex control-flow beyond SLC– low-level syntax (3AIC/P-code): rather flat, linear structure, ultimately just a sequence of

commands


Arrangement of code blocks and cond. jumps

The two pictures show the “control-flow graph” of two structured commands (conditionals and loop). Theyshould be clear enough. However, the pictures can also be read as containg more information than the CFG:The graphical arrangement hints at the fact that ultimate, the code is linear. A crucial command withbe the conditional jump, but those are one-armed commands. That means, one jumps on some condition.But if the condition is not met, one does not jump. That is called “fall-through”. In the picture, it’s“hinted at” insofar that the boxes are aligned strictly from top to botting (a graphical illustration of a(control-flow) graph structure would not need to do that, a graph is a graph consisting of nodes andedges, no matter how one arrange them for illustrative purposes. Secondly, the two graphs use alwaysthe true-case as fall-through. Of course, the underlying intermediate code can support different formd ofconditional jumps (like jump-on-zero and jump-on-non-zero) which may swap the situatiom. Our code willwork with jump-on-false which explains the true-as-fall-through depiction.

Anyway, the pictures are intended to remind us that we are generating code in a linear intermediate codelanguage, and in particular, the graph should not be interpreted (with its true and false edge) should notbe misunderstood to think we still have two-armed jumps.

Conditional

While


The “graphical” representation can also be understood as control flow graph. The nodes contain sequencesof “basic statements” of the form we covered before (like one-line 3AIC assignments) but not conditionalsand similar and no procedure calls (we don’t cover them in the chapter anyhow). So the nodes (also knownas basic blocks) contain staight-line code.

In the following we show how to translate conditionals and while statements into intermediate code, bothfor 3AIC and p-code. The translation is rather straightforward (and actually very similar for both cases,both making use of labels).

To do the translation, we need to enhance the set of available “op-codes” (= available commands). Weneed a mechanism for labelling and a mechanism for conditional jumps. Both kind of statement need to beadded to 3AIC and p-code, and it basically works the same, except that the actual syntax of the commandsis different. But that’s details.

Jumps and labels: conditionals

if (E) then S1 else S2

3AIC for conditional

<code to e v a l $E$ to t1>i f _ f a l s e t1 goto L1

<code f o r $S_1$>goto L2label L1

<code f o r $S_2$>label L2

P-code for conditional

<code to e v a l u a t e $E$>f j p L1

<code f o r $S_1$>ujp L2lab L1<code f o r S2>lab L2

3 new op-codes:

• ujp: unconditional jump (“goto”)• fjp: jump on false• lab: label (for pseudo instructions)

Jumps and labels: while

while (E) S

3AIC for while

label L1<code to e v a l u a t e $E$ to t1>i f _ f a l s e t1 goto L2

<code f o r S>goto L1label L2


P-code for while

lab L1<code to e v a l u a t e $E$>f j p L2

<code f o r $S$>ujp L1lab L2

Boolean expressions

• two alternatives for treatment1. as ordinary expressions2. via short-circuiting

• ultimate representation in HW:– no built-in booleans (HW is generally untyped)– but “arithmetic” 0, 1 work equivalently & fast– bitwise ops which corresponds to logical ∧ and ∨ etc

• comparison on “booleans”: 0 < 1?• boolean values vs. jump conditions

Short circuiting boolean expressions

The notation is C-specific, and a popular idiom for nifty C-hackers. For non-C users it may look a bitcryptic. A “popular” error in C-like languagues are nil-pointer exceptions, and programmers a well-advisedto check pointer accesses whether the pointer is nil or not. In the example, the access p -> val wouldderail the program if p were nil. However, the “conjuction” checks for nil-ness, and the nifty programmerknows that the first part is checked first. And not only that, if it evaluates to false (or 0 in C), thesecond conjuct is not executed (to find out if it’s true or false), it’s jumped over. That’s known as “circuitevaluation”.

Short circuit illustration

i f ( ( p!=NULL) && p −> v a l ==0)) . . .

• done in C, for example• semantics must fix evaluation order• note: logically equivalent a ∧ b = b ∧ a• cf. to conditional expressions/statements (also left-to-right)

a and b , if a then b else falsea or b , if a then true else b

Pcode

lod xldc 0neq ; x!=0 ?f j p L1 ; jump , i f x=0lod ylod xequ ; x =? yujp L2 ; hop overlab L1ldc FALSElab L2


• new op-codes– equ– neq

The code is a bit cryptic (one should ponder what it computes . . . ). It might not be also the bestrepresetation, for instance, one may come up with a different solution that does not load x two times.

A side remark: we are still at intermediate code. Optimizations and the use of registers have not yetentered the picture. That is to say, that the above remark that x is loaded two times might be of not somuch concern ultimately, as an optimizer and register allocator should be able to do something about it.On the other hand: why generate inefficient code in the hope the optimizer will clean it up.

Grammar for loops and conditionals

stmt → if -stmt | while-stmt | break | otherif -stmt → if ( exp ) stmt else stmt

while-stmt → while ( exp ) stmtexp → true | false

• note: simplistic expressions, only true and false

typedef enum {ExpKind , I f k i n d , Whilekind ,BreakKind , OtherKind} NodeKind ;

typedef struct s t r e e n o d e {NodeKind kind ;struct s t r e e n o d e ∗ c h i l d [ 3 ] ;int v a l ; /∗ used w i t h ExpKind ∗/

/∗ used f o r t r u e v s . f a l s e ∗/} STreeNode ;

type StreeNode ∗ SyntaxTree ;

Translation to P-code

i f ( t r u e ) while ( t r u e ) i f ( f a l s e ) break e l s e o t h e r

Syntax tree


P-code

ldc t r u ef j p L1lab L2ldc t r u ef j p L3ldc f a l s ef j p L4ujp L3ujp L5lab L4Otherlab L5ujp L2lab L3lab L1

Code generation

• extend/adapt genCode• break statement:

– absolute jump to place afterwards– new argument: label to jump-to when hitting a break

• assume: label generator genLabel()• case for if-then-else

– has to deal with one-armed if-then as well: test for NULL-ness

• side remark: control-flow graph (see also later)– labels can (also) be seen as nodes in the control-flow graph– genCode generates labels while traversing the AST⇒ implict generation of the CFG– also possible:

∗ separately generate a CFG first∗ as (just another) IR∗ generate code from there

Code generation procedure for P-code


Code generation (p-code)

The code is best studied by oneself. It is a C-style representation. The code generated is p-code, thoughactually the important message of that procedure is not that. The code also resembles earlier C-codeimplementation of p-code generation, basically a recursive procedure wit a post-fix generation of code forexpression evaluation. We have seen that before.

Of course, now we have to make jumps and use labels. The most important or most high-level changein the procedure has to do with handling labels. In principle, we have seen what labels are and how touse them. Now, however, we have a concrete recursive procedure, traversing the tree. Now, the (small)challenge we have is: sometimes one has to inject a jump-command to some label which, at that point inthe traversal, is not yet available, as not yet being generated. This is needed (for instance) when doing abreak-statement in a loop. The way the code deals with it is that it takes a label as additional argument,that is used to jump-to when processing a break. This argument is handed down the recursive calls.

There are alterntaive ways to deal with this (mini-)challenge. Later we also have a look at an alternativeways, making use of two labels as argument.

Code generation (1)


Code generation (2)

More on short-circuiting (now in 3AIC)

• boolean expressions contain only two (official) values: true and false• as stated: boolean expressions are often treated special: via short-circuiting• short-circuiting especially for boolean expressions in conditionals and while-loops and similar

– treat boolean expressions different from ordinary expressions– avoid (if possible) to calculate boolean value “till the end”

• short-circuiting: specified in the language definition (or not)

Example for short-circuiting

Source

i f a < b | |( c > d && e >= f )

thenx = 8

e l s ey = 5

endif

3AIC

t1 = a < bif_true t1 goto 1 // s h o r t c i r c u i tt2 = c > di f _ f a l s e goto 2 // s h o r t c i r c u i tt3 = e >= fi f _ f a l s e t3 goto 2label 1x = 8goto 3label 2y = 5label 3


Code generation: conditionals (as seen)

Alternative P/3A-Code generation for conditionals

• Assume: no break in the language for simplicity• focus here: conditionals• not covered of [9]


Alternative 3A-Code generation for boolean expressions

10 Code generation 367

Code generationChapter

Whatis it


1. 2AC2. cost model3. register allocation4. control-flow graph5. local liveness analysis (data flow analysis)6. “global” liveness analysis

Contents

10.1 Intro . . . . . . . . . . . . . . 36710.2 2AC and costs of instructions 37410.3 Basic blocks and control-

flow graphs . . . . . . . . . . 37910.4 Code generation algo . . . . . 39410.5 Global analysis . . . . . . . . 399

10.1 Intro

Overview

This chapter does the last step, the “real” code generation. Much of the material is based on the (old)dragon book [2]. The book is a classic in compiler construction. The principles on which the code generationare discussed are still ok. Technically, the code generation is done for two-adddress machine code, i.e., thecode generation will go from 3AIC to 2AC, i.e., to an architecture with 2A instruction set, instructions witha 2-address format. For intermediate code, the two-address format (which we did not cover), is typicallynot used. If one does not use a “stack-oriented” virtual machine architecture, 3AIC is more convenient,especially when it comes to analysis (on the intermediate code level).

For hardware architectures, 2AC and 3AC have different strengths and weaknesses, it’s also a question ofthe technological state-of-the-art. There are both RISC and CISC-style design based on 2AC as well as3AC. Also whether the processor uses 32-bit or 64-bit instructions plays a role: 32-bit instructions maysimply be too small to accomodate for 3 addresses. These questions, how to design an instruction set thatfits to current state or generation of chip or processor technology for some specific application domainbelongs to the field of computer architecture. We assume a instruction set as given, and base the codegeneration on a 2AC instruction set, following Aho et al. [2]. There is also a new edition of the dragonbook [1], where the corresponding chapter has been “ported” to cover code generation for 3AC in the newversion, vs. the 2AC generation of the older book. The principles don’t change much. One core problemis register allocation, and the general issues discussed in that chapter would not change, if one would do itfor a 2A instruction set.

Register allocation

Of course, details would change. The register allocation we will do will be on the one hand actually prettysimple. Simple in the sense that one does not make a huge effort of optimization. One focus will be oncode generation of “straight-line intermediate code”, i.e. code inside one node of a control-flow graph.Those code-blocks are also known as basic blocks. Anyway, the register allocation method walks through

368 10 Code generation10.1 Intro

one basic block, keeping track on which variable and which temporary currently contains which value,resp. for values, in which variables and/or register they reside. This book-keeping is done via so-calledregister descriptors and address descriptors. As said, the allocation is conceptually simple, (focusing onnot-very agressive allocation inside one basic block, ignoring more complex addressing mode we discussedin the previous chapter). Still, the details look, well, already detailed and thus complicated. Those detailswould, obviously change, if we would use a 3AC instruction set, but the notions of address and registerdescriptors would remain. Also the way, the code is generated, walking through the instructions of thebasic block, could remain. The way it’s done is “analogous” on a very high level to what had been calledstatic simulation in the previous chapter. “Mentally” the code generator goes line by line through the3AIC, and keeps track of where is what (using address and register descriptors). That information usefulto make use of register, i.e., generating instructions that, when executed, reuse registers, etc.

That also includes making “decisions” which registers to reuse. We don’t go much into that one (likeasking: if a register is “full”, contains a variable, is it profitable to swap out the value?). By swapping, Imean, saving back the value to main memory, and loading another value to the register. If the new valueis more “popular” in the future, being needed more often etc, and the old value maybe less, then it is agood idea to swap them out, in case all registers are filled already. If there is still registers free, the simplestrategy will not bother to store anything back (inside one basic block), it would simply load variables toregisters as long as there is still space for it. It tries not not

Optimization (and “super-optimization”), local and global aspects

Focusing on straightline code, we are dealing with a finite problem (similar to the setting when translatingp-code to 3AIC in the previous chapter), so there is no issue with non-termination and undecidability.One could try therefore to make an “absolutely optimal” translation of the 3AIC. The chapter will discusssome measures how to estimate the quality of the code, it’s a simple cost model. One could use thatcost model (or others, more refined ones) to define what optimal means, and the produce optimal codefor that. Optimizations that are ambitious in that way are sometimes called “super-optimization” andcompiler phases that do that are super-optimizers. Super-optmization may not only target register usageor cost-models like the one used here, it’s a general (but slighty weird) terminology for transforming codeinto one which genuinely and demonstrably optimal (according to a given criterion). In general, that’s ofcourse fundamentally impossible, but for straight-line code it can be done.

The code generation here does not do that. Actually, it’s not often attempted outside this lecture as well.One reason should be clear: it’s costly. For long pieces of staight-line code (i.e., big basic blocks) it maytake too much time. There is also the effect of reducing marginal utility. A relatively modest and simple“optimization” may lead to initially drastic improvement, compared to not doing anything at all. However,to get the last 10% of speed-up or improvement pushes up the required effort disproportionally.

Another (but related) reason is: super-optimization can be achieved at all only for parts of the code (likestraightline code and basic blocks). One can push the boundaries there, as long as it remains a finiteproblem, for instance allowing branching (but leaving out loops). As a side remark: Symbolic executionis an established terminology and technique which can be seen as some form of “static simulation” butaddressing also conditionals. At any rate, that will make the problem a more compicated and targetslarger chunk of code, which drives up the effort as well.

So, even if we are target larger chunks of code or are more aggressive in the goals of optimization, thereare boundaries of what can be done. If we stick to our setting, where we currently generate code per basicblock, super-optimization may be costly but doable. But it’s locally optimal, per one block. Especiallywhen having a code, where local blocks are small, that would have the positive effect that locally super-optimized code may be done without too much effort: But what for, if the non-local quality is bad? Focusesall optimization effort onto the local block and ignoring the global situation may be a an unbalanced useof resources. It may be better to do an decent (but not super-optimal) local optimization that, with alow-effort approach achieves already drastic improvments, and also invests in simple global analysis andoptimization (perhaps approximative), to also reap there low-effort but good initial gains.

That’s also the route the lecture takes: now we are doing a simple register allocation, without muchoptimization or strategy to find the best register usage (and we discuss also one global aspect of program,across the bboundaries of one elementary block. That global aspect will be live variable analysis, that

https://en.wikipedia.org/wiki/Superoptimization

10 Code generation10.1 Intro 369

will come later, because first let’s discuss local live variable analysis which is used for the local codegeneration. We can remark already here, that live variable analysis can be done locally or globally; thegeneration just uses live variable information for its task, whether that information is local or global. Sothe code generation is, in that way, independent form whether one invests on local or on global live variableanalysis. It’s just produces better code, i.e., makes better use of registers, when being based on betterinformation (like using live variable information coming from a global live variable analysis). Indeed, thecode generation would produce semantically correct code, without any live variable analysis! In that way,the analysis and the code generation are separate problems (but not independent, as the register allocationin the code generation makes use of the information from live variable analysis).

Concerning the “degree of locality” of the code generation. The algorithm works super-locally, insofar thatit generates 2AC line by line: every line of 3AIC is translated onto 2 (or sometimes one) line of 2AC. Thereis no attempt afterwards to go through the 2AC again, getting some more global perspective and themoptimize it, for instance rearranging the lines, or getting a use of the register, better that the one thathad been arranged for by the line-by-line code generation. The code generation is not completely local.In the previous chapter, the macro expansion was really line-by-line local, where 3AIC was translated to1AIC (i.e., p-code): each 3ACI line was expanded into some lines of p-case in a completely “context-free”manner, focusing on each individual line independent in which context is is used. That simplistic expansionignored the past, i.e., what happened before, and the future, i.e., what will happen afterwars. The codegeneration takes, in a simple manner, for both aspects. What has happened in the past is kept trackedby the register and address descriptor. Aspects of the future are taken care of by the liveness analysis.Depending on whether one does as block-local liveness analysis or a global analysis just changed how “farinto the future” the analysis looks. As far as the past in concerned: that one is (in our presentation) justblock-local. The book-keeping with the register and address descriptor starts fresh with each block, thereis here no memory of what potentially had happened in some earlier block.

Live variable analysis

Now, what is live variable analysis anyway, after all, and what role does it play here? Actually, being alivemeans a simple thing for a variable: it means the variable “will” be used in the future. One could duallyalso say, a variable is dead, if that is not the case (only that one normally talks about variables beinglive, not so much about their death. “Death analysis” or similar would not sound attractive. . . ). That’simporant information, especially when talking about register allocation: if it so happens that the value ofa variable is stored in a register and if one figures additionally out, that the variable is dead (i.e., not usedin the future), the register may be used otherwise. What that involves, we elaborate on further below,in first approximation we can think that the register is simply “free” and can just be used when neededotherwise.

Now, the “definition” for a variable of being live is a bit unprecise, and we wrote that the variable “willbe used in the future” using quotation marks. What’s the problem? The problem is that the future maybe unknown, it may be impossible to know the exact future. There can be different reasons for that.One is, depending how which language (fragment) one targets for the analysis, fundamental principles likeundecidablity may prevent the the future behavior from exactly be known. There can be actually anotherreason, namely if one analyzes not a global program but only a fragment (maybe one basic block, oneloop body, one procedure body). That means, the program fragment being analyzed is “open” insofarits behavior may depend on data coming from outside. In partcular, the program fragment’s behaviordepends on that outside data or “input”, when conditionals or conditional jumps are used. Even if thepossible input is finite, maybe just a single bit, i.e., a single input of “boolean type”, that may influencethe behavior. One behavior where, at a given point a variable will be used, and another behavior, wherethat variable will not be used. In one future behavior, the variable is live, in the other future, it is dead.Not knowing whether the input is true of false, one cannot say that the variable “will” be used or not, itsimply depends. This obstacle is a different one than principle undecidability of general programs, whichapplies to closed programs already. For finite possible inputs (and without loops) the problem is stillfinite: an analysis can just “statically simulate” all runs one by one for each input, and for each individualbehavior it exactly known at each point, whether a variable will be used or not, assuming that the programis deterministic. But overall, without the input known, the program behavior is unknown.


Coming back to the “definition” of liveness. The long discussion hopefully clarified, that in a general setting,when analyzing a (piece of a) program it cannot be about whether a variable will be used. The question iswhether the variable may be used. We want to use the liveness information in particular to see if one canconsider a register as free again. If there exists a possible future where the variable may be used, then thecode generator cannot risk reusing the register. That means, the notion of (static) liveness is a question ofa condition that “may-in-the-future” apply. There are other interesting conditions of that sort, some wouldbe characterized by “must” instead of “may”. And some may refer to the past, not the future. That wouldlead to the area of data-flow analysis (or more ambitiously abstract interpretation). We won’t go deepthere, we stick to live-variable analysis (for the purpose of code generation). However, if one understandslive variable analysis, especially the global live variable analysis covered later, one has understood coreprinciples of many other flavors of data flow analysis (may or must, forward or backward).

Talking about conditions applying to the “past”, perhaps we should defuse a possible misconception.Liveness of a variable refers to the future, and we said, there are reasons why one cannot know the future.Everyone knows, it’s hard do prediction, in particular concerning the future. So one may come to believethat analyzing the past would not face the same problems. When running a (closed) program that maybe true: we cannot know the future, but we may record the past (“logging”), so the past is known. Buthere we are still inside the compiler, doing static analysis and we may deal with open program fragments.For concretness sake, let’s use some particular question for illustration: “undefined variables” (or nil-pointer analysis). That refers to some condition in the past, namely there exists a run, where there is noinitialization of a variable. Or dually, a variable is properly initialized at some point, when for all paststhat lead to that point the variable has been initialized. But for open programs (and/or working withabstractions), there may statically be more than one possible past and we cannot be sure which one willconcretely be taken. Maybe indeed all or some of them will be taken at run time, when the code fragmentbeing under scrutiny is executed more than once. That is the case when the analyzed code is part of a loop,or correspond to a function body called variously with different arguments. In summary, the discinctionbetween “may” and “must” applies also to analysizing properties concerning the past.

Reusing and “freeing” a register

We said that the liveness status of a variable is very important for register usage. That’s understable: avariable being dead does not need to occupy precious register space, and the register can be “freed”. Wepromised in the previous paragaph that we would elaborate on that a bit, as it involves some fine pointsthat we will see in the algo later, which may not be immediately obvious. First of all, as far as the hardwareplatform is concerned, there is no such thing as a full, non-free or empty or free register. A register isjust some fast and small piece of specific memory in hardware in some physical state, which correspondsto a bit pattern or binary representation. The latter one is a simplification or abstraction, insofar theregisters may be in some “intermediate, instable” state in (very short) periods of time between “ticks” ofthe hardware clock. So, the binary illusion is an abstraction mainained typically with the help of a clock,and compilers rely on that: registers contain bit strings or words consisting of bits. But it’s not the casethat 0000 “means” empty, for course. But when is a register empty then? As said, as far as the hardwareis concered, that executes the 2AC that we are now about to generate, fullness and emptyness of registerssimply does not exists. It only consists conceptually inside the compiler and code generator, which has tokeep track of the status and “picturing” registers as full and empty. If the code generator wants to reuse aregister (in that it generates a command that loads the relevant piece of data into a register) the registerprefers to use an “empty” one, for instance one that so far has not been used at all. Initially, it will rateall registers as empty (though certainly some bit pattern is contained in them in electric form, so to say).Now in case a register contains the value for a variable, but the variable is known to be dead, doesn’t thatqualify for the register being free? So isn’t it as easy as the following?

a register is free if it contains dead data (or “no data” insfoar as the register has not beenused before)?

In some way, sure enough, that’s indeed why liveness analysis is so crucial for register allocation. However,one has to keep in mind another aspect. The problem is the following: just because the value of a registeris connected to a variable that is dead does not mean one can “forget” about it and, by reusing the register,overwrite it. So, why not, isn’t that the definition of being dead? In a way, yes. But there are two aspectsof why that’s not enough. One is, that the variable may keep its data in two copies, one in main memory


and one in the register. And it may well be the case that the one in main memory “is out of sync”. Afterall, the code generator loaded the variable to register to faster manipulate the “variable”, therefore itsa good sign that it’s out of sync. Keeping main memory and registers “always” in sync is meaningless;then we would be better off without registers at all. Still, if the variable is really dead, what does thisinconsistency matter? That’s the second point we need to consider: the concrete code generator later willeffectivel make “local” life analysis only. So it can only knows what’s going whether in the current blockthe variable is life or dead (respectively, all variables are “assumed” to be live at the end of a block. That’sdifferent from temporaries, that are assumed to be dead. That means, “one” has to store the value backto main memory. Actually, “one” needs to store that value back, if “one” suspects the values disagrees, ifthere is an inconsistency between them. Who is the “one” that needs to store them value back? Of coursethat’s the code generator, that has to generate, in case of need, a corresponding store command, and ithas to consult the register and address descriptors to make the right decision. After “synchronizing” theregister with the main memory, the register can be considered as “free”.

Local liveness analysis here

That was a slightly panoramic view about topics we will touch upon in this chapter. But the chapter willbe more focused and concrete: code generation from 3AIC to 2AC, making use of liveness analysis which ismainly done locally, per basic block. We so far discussed live variable analysis and problems broader thanwe actually need for what is called local analysis here (local in the sense per basic block local). For basicblocks, which is straight-line code, there is neither looping (via jumps) nor is there branching (which wouldlead to don’t know non-determinism in the way described). That’s the reason why techniques similar whathas been called “static simulation” earlier will be used. The live variable analyzer steps through the codeline by line, and that may be called simulation (the terms simulation or static simulation are, howere nottoo widely used).

There are two aspects worth noting in that context. One is, when talking about “simulation” it’s notthat the analysis procedure does exactly what the program will do. Since we are doing local analysis ofonly a fragement of a program (as basic block) we don’t know the concrete values, that’s not easily done(one could do it symbolically though). By we don’t need to do that, as we are not interested in what theprogram exactly does, we are interested in one particular aspect of the program, namely the question ofthe liveness-status of variables. In other words, we can get away in working with an abstraction of theactual program behavior. In the setting here, for local liveness, even given the fact that the basic blockis “open”, that allows exact analysis, in particular we know exactly wether the variable is live or is not.So the “may” aspect is discussed above is irrelvant locally. The fact that we don’t the exact values ofthe variables (coming potentially from “outside” the basic block under consideration) does not influencethe question of liveness, it’s indepdendent from the values. If we would have conditionals, that wouldchange that. So, in that way it’s not a “static simulation” of actual behavior, it’s more simulation steppingthrough progam but working with an abstract representation of the involved data. As said, the concretevalues can be abstracted away, in this case, without loosing precision.

The second aspect we would to mention in connection with calling the analysis some form of “staticsimulation”: actually, the live analysis “steps” through the program in a backward manner. In that sense,the term “simulation” may be dubious (actually, the term static simulation is not widely used anyway). Butactually, in a more general setting of general data flow analysis, there are many useful backward analyses(live variable analysis being one prominent example) as well as many useful forward analysis (undefinedvariable analysis would be one).

Therefore, in our setting of code generation: the code generation will “step” though the 3AIC in a forwardmanner, generating 2AIC, keeping track of book-keeping information known as register descriptors andaddress destriptors. In that process, the code generation makes use of information whether a variable islocally live or is not locally live (or on whether a variable may be globally live or not when having globalliveness info at hand). That means, prior to the code generation, there is a liveness analysis phase, whichworks backwardly.


Exactness of local liveness analysis (some finer points) To avoid saying something incorrect,let’s qualify the claim from above that stipulated: for straight-line 3AIC, exactly liveness calculation ispossible (and that what we will do). That’s pretty close to the truth. . .

However, we look at the code generation ignoring complicating factors, like more complex addressingmodes, and “pointers”. We stated above: liveness status of a variable does not depend on the actual valuein the variable, and that’s the reason why exact calculation can be done. Unfortunately, in the presenceof pointers, aliasing enter the picture, and the actual content of the pointer variable plays a role. Similarcomplications for other more complex addressing modes. We don’t cover those complications though. Wefocus on the most basic 3AIC instructions, but when dealing with a more advanced addressing modes (asdone in realistic settings), the exact future liveness status would be known, not even for straight-line code.[2] covers also that, but it’s left-out from the slides and the pensum.

There is another fine point. The assumption that in straight-line code, we know what each line is exe-cuted exactly once is actually not true! In case our instruction set would contain operations like division,there may be division-by-zero exceptions raised by the (floating point) hardware. Similarly, there may beoverflows or underflows by other respective hardware. Whether or not such an exception occurs dependson the concete data. So, it’s not strictly true that we know whether a variable is live or is not. It may be,that an exception derails the control flow, and, from the point of the exception, the code execution in thatblock stops (something else may continue to happen, but at least not in this block). One may say: well,if such a low-level error occurs, probably trashing the program, who cares if the live variable analysis wasnot predicting the exact future 100%?

That’s a standpoint, but a better one is: the analysis actually did not do anything incorrect. The livenessanalysis is a “may” analysis, and that even applies to straight-line code. The analysis says a variable inthat block may be used in the future, but in the unlikely event of some intervening catastrophe, it actuallymay not be used. And that’s fine: considering a variable live, when in fact it turns out not to be the caseis to err on the safe side. Inacceptable would would be the opposite case: an exception would trick thecode generator to rate variables as dead, when, in an exception, they are not. But fortunately that’s notthe case, so all is fine.

Code generation

• note: code generation so far: AST+ to intermediate code– three address intermediate code (3AIC)– P-code

• ⇒ intermediate code generation• i.e., we are still not there . . .• material here: based on the (old) dragon book [2] (but principles still ok)• there is also a new edition [1]

In this section we work with 2AC as machine code (as from the older, classical “dragon book”). Analternative would be 3AC also on code level (not just intermediate code); details would change, but theprinciples would be comparable. Note: the message of the chapter is not: in the last translation and codegeneration step, one has to find a way to translate 3-address code two 2-address code. If one assumedmachine code in a 3-address format, the principles would be similar. The core of the code generationis the (here rather simple) treatment of registers. The code generation and register allocation presentedhere is rather straightforward; it will look “detailed” and “complicated”, but it’s not very complex in thatthe optimization puts very much computational effort into the code generation. One optimization doneis is based on liveness analysis. An occurrence of a variable is “dead”, if the variable will not be readin the future (unless it’s first overwritten). The opposite concept is that the occurrence of a variable islive. It should be obvious that this kind of information is essential for making good decisions for registerallocation. The general problem there is: we have typically less registers than variables and temps. Sothe compiler must make a selection: who should be in a register and who not? A static scheme like “thefirst variables in, say, alphabetical order, should be in registers, the others not” is not worth being calledoptimization. . . First-come-first-serve like “if I need a variable, I load it to a registers, if there is still somefree, otherwise not” is not much better. Basically, what is missing is taking into account information whena variable is no longer used (when no longer live), thereby figuring out, at which point a register can beconsidered free again. Note that we are not talking about run-time, we are talking about code generation,


i.e., compile time. The code generator must generate instructions that loads variables to registers it hasfigured out to be free (again). The code generator therefore needs to keep track over the free and occupiedregisters; more precisely, it needs to keep track of which variable is contained in which register, resp. whichregister contains which variable. Actually, in the code generation later, it can even happen that one registercontains the values of more than one variable. Based on such a book-keeping the code generation mustalso make decisions like the following: if a value needs to be read from main memory and is intended tobe in a register but all of them are full, which register should be “purged”. As far as the last question isconcerned, the lecture will not drill deep. We will concentrate on liveness analysis and we will do thatin two stages: a block-local one and a global one. the local one concentrates on one basic block, i.e.,one block of straight-line code. That makes the code generation kind of like what had been called “staticsimulation” before. In particular, the liveness information is precise (inside the block): the code generatorknows at each point which variables are live (i.e., will be used in the rest of the block) and which not (butremember the remarks at the beginning of the chapter, spelling out in which way that this may not be a100% true statement). When going to a global liveness analysis, that precision is no longer doable, and onegoes for an approximative approach. The treatment there is typical for data flow analysis. There are manydata flow analyses, for different purposes, but we only have a look at liveness analysis with the purpose ofoptimizing register allocation.

Intro: code generation

• goal: translate intermediate code (= 3AI-code) to machine language• machine language/assembler:

– even more restricted– here: 2 address code

• limited number of registers• different address modes with different costs (registers vs. main memory)

Goals

• efficient code• small code size also desirable• but first of all: correct code

When not said otherwise: efficiency refers in the following to efficiency (or quality) of the generated code.Fastness of compilation, or with a limited memory print) may be important, as well (likewise may thesize of the compiler itself be an issue, as opposed to the size of the generated code). Obviously, there aretrade-offs to be made.

But note: even if we compile for a memory-restricted platform, it does not mean that we have to compileon that platform and therefore need a “small” compiler. One can, of course, do cross-compilation.

Code “optimization”

• often conflicting goals• code generation: prime arena for achieving efficiency• optimal code: undecidable anyhow (and: don’t forget there’s trade-offs).• even for many more clearly defined subproblems: untractable

“optimization”

interpreted as: heuristics to achieve “good code” (without hope for optimal code)

• due to importance of optimization at code generation– time to bring out the “heavy artillery”

374 10 Code generation10.2 2AC and costs of instructions

– so far: all techniques (parsing, lexing, even sometimes type checking) are computationally“easy”

– at code generation/optimization: perhaps invest in aggressive, computationally complex andrather advanced techniques

– many different techniques used

The above statement on the slides that everything so far was computationally simple is perhaps an over-simplificcation. For example, type inference, aka type reconstruction, is typically computationally heavy,at least in the worst case and in languages not too simple. There are indeed technically advanced typesystems around. Nonetheless, it’s often a valuable goal not to spend too much time in type checking andfurthermore, as far as later optimization is concerned one could give the user the option how much timehe is willing to invest and consequently, how agressive the optimization is done. For our coverage of typesystems in the lecture and the oblig: that one is rather simple and elementary, and poses no problems wrt.efficiency.

The word “untractable” on the slides refers to computational complexity; untractable are those for whichthere is no efficient algorithm to solve them. Tractable refers conventionally to polynomial time efficiency.Note that it does not say how “bad” the polynomial is, so being tractable in that sense still might notmean practically useful. For non-tractable problems, it’s often guaranteed that they don’t scale.

10.2 2AC and costs of instructions

Here we look at the instruction set of the 2AC. Well, actually only a small subset of it. In particular, welook at it from the perspective of a “cost model”. Later, we want to at least get a feeling that the code weare generating is “good” but then we need a feeling what the “cost” is of the generated code, i.e., the costof instructions.

When talking about 2AC, it’s actually not a concrete instruction set of a concrete platform. Concretechips have complicated inststruction sets, so it’s more that we focus on a (very small) subset of what couldbe an instruction set of a 2A platform. Now, isn’t that another “intermediate code”? We will see that thecode now (independent from the fact that its 2AC) is more low-level than before. In that way, it couldbe a real instruction set of some hardware. The intermediate code from before could not. There will bea slide, that tries to rub that in. One could tell the same story we are doing here, translating from 3AICto 2AC also by doing a translation from 3AIC to 3AC. Still that would pose equivalent problems (registerallocation, cost model etc), but the presentation here happens to make use of a 2AC.

2-address machine code used here

• “typical” op-codes, but not a instruction set of a concrete machine• two address instructions• Note: cf. 3-address-code intermediate representation vs. 2-address machine code

– machine code is not lower-level/closer to HW because it has one argument less than 3AC– it’s just one illustrative choice– the new Dragon book: uses 3-address-machine code

• translation task from IR to 3AC or 2AC: comparable challenge

2-address instructions format

Format

OP source dest

• note: order of arguments here (esp. for minus)• restrictions on source and target

– register or memory cell

10 Code generation10.2 2AC and costs of instructions 375

– source: can additionally be a constant

ADD a b // b := b + aSUB a b // b := b − aMUL a b // b := b ∗ aGOTO i // u n c o n d i t i o n a l jump

• further opcodes for conditional jumps, procedure calls . . . .

Also the book Louden [9] uses 2AC. In the 2A machine code there for instance on page 12 or the introductoryslides, the order of the arguments is the opposite!

Side remarks: 3A machine code

Possible format

OP s o u r c e 1 s o u r c e 2 d e s t

• but: what’s the difference to 3A intermediate code?• apart from a more restricted instruction set:• restriction on the operands, for example:

– only one of the arguments allowed to be a memory access– no fancy addressing modes (indirect, indexed . . . see later) for memory cells, only for registers

• not “too much” memory-register traffic back and forth per machine instruction• example:

&x = &y + *z

may be 3A-intermediate code, but not 3A-machine code

As we said, the code generation could analogously be done for 3AC instead of 2AC. But what’s thedifference then between 3AIC and 3AC, would the translation not be trivial? Not quite, there is a gapbetween intermediate code and code using the instruction set. The most important difference is the useof registers. Related to that: depending to the exact instruction set, 3AC instructions typically imposerestrictions on the operands of the instructions. In the purest form, one may allow instructions only of theform r1 := r2 + r3 (here addition as an example), where all arguments, sources and target, must allbe in registers. That would result in a pure load-store architecture: before doing any operation at all, thecode generator must issue appropriate load-commands, and the result needs to be stored back explicitly.That obviously leads at least to longer machine code, measured in number of instruction (but perhaps theinstructions themselvelse may be represented shorter). Analogous restrictions may concern the indirectaddressing modes. Instruction sets with a load-store design are often used in RISC architectures.

Cost model

• “optimization”: need some well-defined “measure” of the “quality” of the produced code• interested here in execution time• not all instructions take the same time• estimation of execution• factors outside our control/not part of the cost model: effect of caching

https://en.wikipedia.org/wiki/Load%E2%80%93store_architecture


cost factors:

• size of instruction– it’s here not about code size, but– instructions need to be loaded– longer instructions ⇒ perhaps longer load

• address modes (as additional costs: see later)– registers vs. main memory vs. constants– direct vs. indirect, or indexed access

The cost model (like the one here) is intended to model relevant aspects of the code, that influence theefficiency, in a proper and useful manner. The goal is not a 100% realistic representation of the timingsof the processor. It will be based on assigning rule-of-thumb numerical costs to different instructions.Actually, it’s very simple. The main observation is: accessing a register is “very much” faster thanaccessing main memory. But the model does not use realistic figures (maybe by consulting the specs ofthe machine or doing measurements). Indeed, “main memory” access may not have a uniform access cost(in terms of access time). There are factors outside the control of the code generation, which have todo with the memory hierarchy. The code is generated as if there are only two levels: registers and mainmemory. But, of course, that’s not realistic: there is caching (actually a whole hierarchy of caches maybe used). Furthermore, data may even be stored in the background memory, being swapped in and outunder the control of an operating system. Being not under the control of the code generator, those arestochastic influences. The compiler is not completely helpless facing caches and other memory hierarchyeffects. Based on assumptions how chashing and paging typically works, the code generator could try togenerate code that has good characterisics concerning “locality” of data. Locality means that in general it’sa good idea to store data items “than belong together” in close vicinity, and not sprinkle them randomlyacross the address space (whatever “belonging together” means). That’s because the designer of the codegenerator knows that this suites chaching or swapping algorithms, that perhaps swap out cache lines,banks of adjacent addresses, whole memory pages etc. As far as caches is concerned, that’s simply arational hardware design. But one can also turn the argument around: hardware designers know, thatit’s “natural” that data structures coming from a high-level data structure of a structured programminglanguage (and which contain conceptually data “that belongs together) will be generated in a way being“localized”. Even if the compiler writer has never thought of efficiency and memory hierarchies, it’s simplynatural to place different fields of a record side by side. Also for more complex, dynamic data structures,such principles are often observed: the nodes of a tree are all placed into the same area and not randomly.More tricky maybe the the presence of a garbage collector, that could mess that up, if done mindlessless.But also the garbage collector can maken an effort to preserve locality. So, in a way, it all hangs together:well-designed memory placement will be rewared by standard ways managing memory hierarchy, and well-designed memory management will run standard memory layout by compilers faster. It’s almost a situationof co-evolution.

But all that is more a topic for how the compiler arranges memory (beyond the general principles wediscussed in connection with memory layout and the run-time environments). Here we are looking morefocused on the code generation and trying to attribute costs on individual instruction (so questions oflocality cannot be considered, as they are about the global arrangement, neither can questions of cashingetc, as one individual instruction and the instruction set is not aware of caching, let alone the influenceof the operating system. So, how can we express the very rough observation “registers are very muchfast than memory accesses”? That’s easy, register access costs “nothing”, it will have a zero costs. Mainmemory accesses will have cost of 1. Mathematically it means, memory access is infinitely most costly thanregisters, but as said, it’s a model that may be use to generate efficient code, not as a realistic prediction ofactual running time in the physical world. Even if we had realistic figures from some where (via profilingand measuring average execution times under typical conditions), the use would be limited: as stresseda few times, genuine and absolute optimal performance is (and cannot be) the goal (super-optimizationaside). The goal is getting good or excellent performance with decent amount of effort. Precision we mayadd to the cost model maybe for nothing, as we will be happy to use the cost model as a rough guidelineon decisions like

when translating one line of 3AIC, shall I use a register right now or rather not?

10 Code generation10.2 2AC and costs of instructions 377

We will see that this is the way the code generator will work. One might not even call it “optimization”, atleast not in the sense the first some code is generated which afterwards is improved (optimized). The codegenerator takes the cost model into account on-the-fly, while spitting out the code. Actually, it does noteven consults the cost model (by invoking a function, comparing different alternatives for the next lines,and then choosing the best). It simply compiles line after line, and the decisions are plausible, and oneconvince oneself of the plausibility by looking at the cost model. Actually, one can convince oneself of theplausibility even without looking at the cost model, just knowing that registers should be preferred whenpossible. But actually that’s one of two important pieces of common knowledge the cost model captures.

What’s the second piece then? The other piece is that executing one command costs also something.So, each “line” costs 1. In that sense, the 0-costs of register access is realistic, insofar registers accessis typically done in one processor cycle, i.e., in the same time slice than the loading and executing theinstruction as a whole. So, in that sense, register accesses really don’t cost anything additional. Otheraccesses incur additional costs, and since we don’t aim at absolute realism, all the non-register accessescosts 1.

Instruction modes and additional costs

Mode Form Address Added costabsolute M M 1register R R 0indexed c(R) c+ cont(R) 1

indirect register *R cont(R) 0indirect indexed *c(R) cont(c+ cont(R)) 1

literal #M the value M 1 only for source

• indirect: useful for elements in “records” with known off-set• indexed: useful for slots in arrays

We see that there are no real restictions when and when not memory access are allowed and when registers.Earlier we mentioned something like “load-store” architectures, which does

Concerning the format, the code is split into 3 parts (following the 2AC format), each 4 byte (or 4 octets)long. That corresponds to a 32-bit architecture. That’s a popular format (actually, it’s pretty old, therehad been 32-bit machines early on (not micro-processors at that time). There are 16-bit microprocessors(in the past), and there are 64-bit processors as well. Of course, having 4 bytes for the op-code does notmean all codes are actually used for actual instructions (that would be way too many). But we have tokeep in mind (or at least in the head of our mind, as that’s no longer the concern of a compiler writer): theinstructions need to be handled by the given hardware with a given size of the “bus”, there is no longerthe freedom and flexibility of software. In particular, it’s not “byte code” (more like 4-bytes code. . . )And actually, it’s nice to think of a binary code as to represent “addition” or “jump”, but the 0 and 1’sin the code actually are connected to hardware, the slots in the 32-bit word are “wired up” connectingthem to logical gate that open and close and trigger other bits/electrons to flow from there to there thatultimatly result in another bit pattern that can interpret as that an addition has happened (on our levelof abstraction). So the actual bit-codes for the logical machine instructions are are “sparcely” distributed,and some bit-pattern are not simply unused (“undefined”) but would open and close the “logic gates” ofthe chip in a weird, meaningless manner. As said, all that is not the concern of a compiler writer, who cansee an add-code as addition, but it’s interesteding that the story does not end there, there are complexlayers of abstraction below that and also, that we are leaving the world of “anything goes” of software: thecompiler writer can design any form of intermediate representations in intermediate codes and translatebetween them etc. But below that, things get more restricted by the physics and the laws of nature.


Examples a := b + c

The examples are not breathtakingly interesting. The show different possible translations and their costs.The first pair of examples shows to equivalent ways of translating them, one operating directly on themain memory, one partly loading the arguments to a register and then using that. Both version (in ourcost model) have the same cost (despite the fact that the first program has to execute 3 commands andthe second only 2).

The other two examples calculate the same command, but under a different assumption, namely: thearguments are already loaded in some registers. That drives down the costs. But that should be prettyclear, that’s why one has registers, after all.

We also see that it to profit from the use of registers, the code generator needs to know which vari-ables are stored in the registers already. That will be done by so-called address descriptors and registerdestriptors..

Also, especially the second example shows, that sometime the generated code is a bit strange: Since wehave only 2AC, one argument is source, the other one is source and destination. That means, 2AC likeaddition “destroy” one argument. That means, in general we need to temporarily copy that argumentsomewhere else, otherwise it would be destroyed. In the second example, since a is updated, the first stepuses a for that temporary copy of b.

Using registers

MOV b , R0 // R0 = bADD c , R0 // R0 = c + R0MOV R0 , a // a = R0

c o s t = 6

Memory-memory ops

MOV b , a // a = bADD c , a // a = c + a

c o s t = 6

Data already in registers

MOV ∗R1 , ∗R0 // ∗R0 = ∗R1ADD ∗R2 , ∗R1 // ∗R1 = ∗R2 + ∗R1

c o s t = 2

Assume R0, R1, and R2 contain addresses for a, b, and c

Storing back to memory

ADD R2 , R1 // R1 = R2 + R1MOV R1 , a // a = R1

c o s t = 3

Assume R1 and R2 contain values for b, and c

10 Code generation10.3 Basic blocks and control-flow graphs 379

10.3 Basic blocks and control-flow graphs

We have mentioned (in the introductory overview of this chapter and elsewhere) the concepts of basicblocks and control-flow graphs already. Before we continue we introduce those concepts more robustly.The notion of control flow graph is in this lecture is used at the level of IC (maybe 3AIC). The notion ofCFG makes also sense on highler levels of abstractions and lower level of abstractions, i.e., one can do acontrol-flow graph also for abstract syntax and also on machine code. At compiler desinger can also makethe decision to more than one use of CFGs as intermediate representation.

Here, we have generated 3AIC, with conditional jumps etc. And then we “reconstruct” a more high-levelrepresentation of the code by figuring out the CFG (at that level). It is not uncommon to do a CFG first,and uses the CFG assisting in the (intermediate) code generation.

Anyway, the general concept of CFG works analogously at all levels, same for basic blocks.

Basic blocks

• machine code level equivalent of straight-line code• (a largest possible) sequence of instructions without

– jump out– jump in

• elementary unit of code analysis/optimization1

• amenable to analysis techniques like– static simulation/symbolic evaluation– abstract interpretation

• basic unit of code generation

Control-flow graphs

CFG

basically: graph with

• nodes = basic blocks• edges = (potential) jumps (and “fall-throughs”)

• here (as often): CFG on 3AIC (linear intermediate code)• also possible CFG on low-level code,• or also:

– CFG extracted from AST2

– here: the opposite: synthesizing a CFG from the linear code• explicit data structure (as another intermediate representation) or implicit only.

When saying on the slides, a CFG is “basically” a graph, we mean that, apart from some fundamentalswhich makes them graphs, details may vary. In particular, it may well be the case in a compiler, thatcfg’s are some accessible intermediate representation, i.e., a specific concrete data structure, with concretechoices for representation. For example, we present here control-flow graphs as directed graphs: nodes areconnected to other nodes via edges (depicted as arrows), which represent potential successors in terms ofthe control flow of the program. Concretely, the data structure may additionally (for reasons of efficiency)also represent arrows from successor nodes to predecessor nodes, similar to the way, that linked lists maybe implemented in a doubly-linked fashion. Such a representation would be useful when dealing with dataflow analyses that work “backwards”. As a matter of fact: the one data flow analysis we cover in thislecture (live variable analysis) is of that “backward” kind. Other bells and whistles may be part of theconcrete representation, like dedicated start and end nodes. For the purpose of the lecture, when don’t go

1Those techniques can also be used across basic blocks, but then they become more costly and challenging.2See also the exam 2016.

380 10 Code generation10.3 Basic blocks and control-flow graphs

into much concrete details, for us, cfg’s are: nodes (corresponding to basic blocks) and edges. This generalsetting is the most conventional view of cfg’s.

From 3AC to CFG: “partitioning algo”

• remember: 3AIC contains labels and (conditional) jumps⇒ algo rather straightforward• the only complication: some labels can be ignored• we ignore procedure/method calls here• concept: “leader” representing the nodes/basic blocks

Leader

• first line is a leader• GOTO i: line labelled i is a leader• instruction after a GOTO is a leader

Basic block

instruction sequence from (and including) one leader to (but excluding) the next leader or to the end ofcode

The CFG is determined by something that is called here “partitioning algorithm”. That’s a big name forsomething rather simple. We have learned in the context ofminimization of DFAs the so-called partitioningrefinement approach, which is a clever thing. The partitioning here is really not fancy at all, it hardlydeserves being called an algorithm. The task is to find in the linear IC largest stretches of straight-linecode, which will be the nodes of the CFG. Those blockes are demarkated by labels and gotos (and of coursethe overall beginning and end of the code.) There is only one small refinement of that: a label which is notused, i.e., not being the target of some jump, does not demarkate the border between to blocks, obviously.An unused label might as well be not there, anyway.

The partitioning algo is best illustrated by example, and since it’s easy enough, understanding the examplemeans understanding the algorithm.

Partitioning algo

• note: no line jumps to L2


3AIC for faculty (from previous chapter)read xt1 = x > 0i f _ f a l s e t1 g o t o L1f a c t = 1l a b e l L2t2 = f a c t ∗ xf a c t = t2t3 = x − 1x = t3t4 = x == 0i f _ f a l s e t4 g o t o L2write f a c tl a b e l L1halt

Faculty: CFG

• goto/conditional goto: never inside block• not every block

– ends in a goto– starts with a label

• ignored here: function/method calls, i.e., focus on• intra-procedural cfg

Intra-procedural refers to “inside” one procedure. The opposite is inter-procedural. Inter-procedural anal-yses and the corresponding optimizations are quite harder than intra-procedural. In this lecture, we don’tcover inter-procedural considerations. Except that call sequences and parameter passing has to do ofcourse with relating different procedures and in that case deal with inter-procedural aspects. But thatwas in connection with the run-time environments, not what to do about in connection with analysis,register allocation, or optimization. So, in this lecture resp. this chapter, “local” refers to inside one basicblock, “global” refers to across many blocks (but inside one procedure). Later, we have a short look at“global” liveness analysis. As mentioned, we dont’ cover analyses across procedures, in the terminogyused here, they would be even “more global” than what we call “global”. Actually, in the more generalliterature, global program analysis would typically refer to analysis spanning more than one procedure.Indeed, one should avoid talking about local analysis without further qualifications; it’s better to speak ofblock-local analysis, procedure-local, method-local, or thread-local, to make clear which level of locality isaddressed.

Levels of analysis

• here: three levels where to apply code analysis / optimizations


1. local: per basic block (block-level)2. global: per function body/intra-procedural CFG3. inter-procedural: really global, whole-program analysis

• the “more global”, the more costly the analysis and, especially the optimization (if done at all)

Loops in CFGs

• loop optimization: “loops” are thankful places for optimizations• important for analysis to detect loops (in the cfg)• importance of loop discovery: not too important any longer in modern languages.

Loops in a CFG vs. graph cycles

• concept of loops in CFGs not identical with cycles in a graph• all loops are graph cycles but not vice versa

• intuitively: loops are cycles originating from source-level looping constructs (“while”)• goto’s may lead to non-loop cycles in the CFG• importance of loops: loops are “well-behaved” when considering certain optimizations/code trans-

formations (goto’s can destroy that. . . )

Cycles in a graph are well-known. The definition of loops here, while closely related, is not identical withthat. So, loop-detection is not the same as cycle-detection. Otherwise there’d be no much point discussingit, since cycle detection in graphs is well known, for instance covered in standard algorithms and datastructures courses like INF2220/IN2010.

Loops are considered for specific graphs, namely CFGs. They are those kinds of cycles which come fromhigh-level looping constructs (while, for, repeat-until).

Loops in CFGs: definition

• remember: strongly connected components

Outermost loop

A outermost loop L in a CFG is a collection of nodes s.t.:

• strongly connected component (with edges completely in L)• 1 (unique) entry node of L, i.e. no node in L has an incoming edge3 from outside the loop except

the entry

• often additional assumption/condition: “root” node of a CFG (there’s only one) is not itself an entryof a loop

Loop

The definition is best understood in a small example. We have not bothered to define a nested loop, i.e.,we focused on outermost ones. The next example contains a nested loop (which is not a SCC).

3alternatively: general reachability.


CFG

B0

B1

B2 B3

B4

B5

• Loops:– {B3, B4} (nested)– {B4, B3, B1, B5, B2}

• Non-loop:– {B1, B2, B5}

• unique entry marked red

The additional assumption mentioned on the slide about the special role of the root node of a control-flow graph is reminiscent, for example, of the condition we assumed for the start-symbol of context-freegrammars in the LR(0)-DFA construction: the start symbol must not be mentioned on the right-hand sideof any production (and if so, one simply added another start symbol S′). The reasons for the assumptionhere are similar: assuming that the root node is not itself part of a loop is not a fundamental thing, it justavoids (in some degenerate cases) a special case treatment. The assumption about the form of the control-flow graph is sometime called “isolated entry”. A corresponding restriction for the “end” of a control-flowgraph is “isolated exit”.

Loop non-examples

We did not very deep into the notion of loops. In particular we did not exactly specify the definition of anested loop (like {B3, B4} in one earlier example), but just defined the notion of top-level loop (with thehelp of SCC). We don’t need exactly the notion of loop in the way we do global analysis later (in the formof global liveness analysis). It works for non-loop cycles (“unstructured” programs) as well as for loop-onlygraphs, at least in the version we present it. If one knows that there are loops-only, one could improve the


analysis (and others). Not in making the result of the analysis better, i.e., more precise, but making theanalysis algorithmis more efficient. That could be done by exploiting the structure of the graph better,for instance exploiting that loops are nested, for instance targeting inner-loops first. In the examples here,such “trick’s” would not work. They violate that each loop is supposed to have a well-define, uniqueentrance node. Since we don’t exploit the presence of loops, we don’t dig deeper here. It should be notedthat the definition of loops (with unique entry points) is classical in CFG and program analysis, one mayfind material where the notion of “loop” is used more loosely (ignoring the traditional definition) whereloop and cycle is basically used interchangably.

One is interested in loops not necessarily as a concept in itself, but in the larger context of optimization.We called loops a fertile ground of optimizations, which is of course also true for general cycles: bothinvolve (potential) repetition of code snippets, and shaving off execution time there, that’s a good idea.Often, the optimization is about moving things outside of the loop, typically “in front” of the loop. That’swhen a unique entrance of a loop comes in handy (sometimes called a loop-header). The non-loop examplesdon’t have a single loop-header.

Loops as fertile ground for optimizations

while ( i < n ) { i ++; A[ i ] = 3∗k }

• possible optimizations– move 3*k “out” of the loop– put frequently used variables into registers while in the loop (like i)

• when moving out computation from the loop:• put it “right in front of the loop”⇒ add extra node/basic block in front of the entry of the loop4

Data flow analysis in general

• general analysis technique working on CFGs• many concrete forms of analyses• such analyses: basis for (many) optimizations• data: info stored in memory/temporaries/registers etc.• control:

– movement of the instruction pointer– abstractly represented by the CFG

∗ inside elementary blocks: increment of the instruction pointer∗ edges of the CFG: (conditional) jumps∗ jumps together with RTE and calling convention

Data flowing from (a) to (b)

Given the control flow (normally as CFG): is it possible or is it guaranteed (“may” vs. “must” analysis)that some “data” originating at one control-flow point (a) reaches control flow point (b).

The characterization of data flow may sound plausible: some data is “created” at some point of originand then “flows” through the graph. In case of branching, one does not know if the data “flows left” or“flows right”, so one approximates by taking both cases into account. The “origin” of data seems alsoclear, for instance, an assignment “creates” or defines some piece of data (as l-value), and one may ask ifthat piece of data is (potentially or necessarily) used someplace else (as r-value), without knowing resp.being interesting in its exact value that is being used. This is sometimes also called def-use analysis. Laterwe will discuss definitions and uses. Another illustration of that picture may be the following question:assuming one has an data-based program with user interaction. The user can interact with it but inputting

4That’s one of the motivations for unique entry.


data (perhaps via some web-interface or similar). That information is then processed and forwarded tosome SQL-data base. Now, the inputs are points of origin, and one may ask if this data may reach theSQL database without being “sanitized” first (i.e., checked for compliance and whether the user did notinject into the input some escapes and SQL-commands).

Anyway, this picture of (user) data originating somewhere in a CFG and then flowing through it is plausibleand not wrong per se, but is too narrow in some way. It sounds as data flow analysis that the data flowanalysis traces (in an abstract, approximative manner) through the graph.

Not all data flow analyses are like that. Actually, the live variable analysis will be an example for that.So more generally, it’s more like that “information pieces of interest” are traced through the graph. Forliveness analysis, the piece of information being traced is future usage. Since the information of interestsmay not be an abstract version of real data, it may also not necessarily be traced in a forward manner.For liveness analysis, one is interested in whether a variable may be used in the future. So the informationof interest is the locations of usage. That are the points of origin of that information one is interested in.And from those points on, the information is traced backwards through the graph. So, this is an exampleof a backward analysis (there are others). Of course, when the program runs, real data always “flows”forwardly, as the program runs forwardly: first data orignates and later is may be consumed. But forsome analysis (like liveness analysis), one changes perspective: instead of asking: where will informationoriginating here (potentially or necessarily) flows to, one asks:

where did information or data arriving here orignate (potentially or necessarily) from.

Data flow as abstraction

• data flow analysis DFA: fundamental and important static analysis technique• it’s impossible to decide statically if data from (a) actually “flows to” (b)⇒ approximative (= abstraction)• therefore: work on the CFG: if there are two options/outgoing edges: consider both• Data-flow answers therefore approximatively

– if it’s possible that the data flows from (a) to (b)– it’s neccessary or unavoidable that data flows from (a) to (b)

• for basic blocks: exact answers possible

Treatment of basic blocs

Basic blocks are “maximal” sequences of straight-line code. We encountered a treatment of straight-linecode also in the chapter about intermediate code generatation. The technique there was called static simu-lation (or simple symbolic execution). Static simulation was done for basic blocks only and for the purposeof translation. The translation of course needs to be exact, non-approximative. Symbolic evaluation alsoexist (also for other purposes) in more general forms, especially also working on conditionals.

In summary, the general message is: for SLC and basic blocks, exact analyses are possible, it’s for theglobal analysis, when one (necessarily) resorts to overapproximation and abstraction.

Data flow analysis: Liveness

• prototypical / important data flow analysis• especially important for register allocation


Basic question

When (at which control-flow point) can I be sure that I don’t need a specific variable (temporary, register)any more?

• optimization: if not needed for sure in the future: register can be used otherwise

Live

A “variable” is live at a given control-flow point if there exists an execution starting from there (given thelevel of abstraction), where the variable is used in the future.

Static liveness

The notion of liveness given in the slides correspond to static liveness (the notion that static livenessanalysis deals with). That is hidden in the condition “given the level of abstraction” for example, usingthe given control-flow graph. A variable in a given concrete execution of a program is dynamically live ifin the future, it is still needed (or, for non-deterministic programs: if there exists a future, where it’s stillused.) Dynamic liveness is undecidable, obviously. We are concerned here with static liveness.

Definitions and uses of variables

• talking about “variables”: also temporary variables are meant.• basic notions underlying most data-flow analyses (including liveness analysis)• here: def’s and uses of variables (or temporaries etc.)• all data, including intermediate results, has to be stored somewhere, in variables, temporaries, etc.

Def’s and uses

• a “definition” of x = assignment to x (store to x)• a “use” of x: read content of x (load x)

• variables can occur more than once, so

• a definition/use refers to instances or occurrences of variables (“use of x in line l ” or “use of x inblock b ”)

• same for liveness: “x is live here, but not there”


Defs, uses, and liveness

CFG

0: x = v + w

. . .

2: a = x + c

3: x =u + v4: x = w

5: d = x + y

• x is “defined” (= assigned to) in 0, 3, and 4• u is live “in” (= at the end of) block 2, as it may be used in 3• a non-live variable at some point: “dead”, which means: the corresponding memory can be reclaimed• note: here, liveness across block-boundaries = “global” (but blocks contain only one instruction

here)

Def-use or use-def analysis

• use-def: given a “use”: determine all possible “definitions”• def-use: given a “def”: determine all possible “uses”• for straight-line-code/inside one basic block

– deterministic: each line has has exactly one place where a given variable has been assigned tolast (or else not assigned to in the block). Equivalently for uses.

• for whole CFG:– approximative (“may be used in the future”)– more advanced techiques (caused by presence of loops/cycles)

• def-use analysis:– closely connected to liveness analysis (basically the same)– prototypical data-flow question (same for use-def analysis), related to many data-flow analyses

(but not all)

Side-remark: SSA

Side remark: Static single-assignment (SSA) format:

• at most one assignment per variable.

• “definition” (place of assignment) for each variable thus clear from its name

We don’t go into SSA, but we shortly mention it in the script here, as it’s a very inportant intermediaterepresentation, which is related to the issues we are discussing here (data flow analysis, def-use and use-def). As we hinted at: there are many data-flow analyses (not just liveness), many of them quite similarconcerning the underlying principles. Transforming code into SSA is an effort, i.e., involves some data-flowtechniques itself. However, once in SSA format, many data-flow analysis become more efficient. Whichmeans, investing one time in SSA may pay off multiple times, if one does more than just liveness analysis.


As a final remark: temporaries in our 3AIC within one elementary block follows the “single-assignment”principle. Each one is assigned to not more than once. The user variables, though can be assigned to morethan once. For straight-line code, i.e., local per elementary block, having also the other variables follow thesingle-assignment scheme would be very easy. Instead of assigning to the same variable a multiple times,one simply renames the variables into a1, a2, a3 etc. each time the original a is updated (and keepingtrack of the new names). So, for SLC, SSA is not a big deal. It becomes more interesting and tricky tofigure out how to deal with branching and loops, but, as said, we don’t go there.

Calculation of def/uses (or liveness . . . )

• three levels of complication1. inside basic block2. branching (but no loops)3. Loops4. [even more complex: inter-procedural analysis]

For SLC/inside basic block

• deterministic result• simple “one-pass” treatment enough• similar to “static simulation”• [Remember also AG’s]

For whole CFG

• iterative algo needed• dealing with non-determinism: over-approximation• “closure” algorithms, similar to the way e.g., dealing with first and follow sets• = fix-point algorithms

We encountered a closure or saturation algorithm in other contexts, for instance when calculating thefirst and follow sets (potentially using a worklist algo). Also the calculation of the epsilon-closure is anexample, and there are others.

Inside one block: optimizing use of temporaries

• simple setting: intra-block analysis & optimization, only• temporaries:

– symbolic representations to hold intermediate results– generated on request, assuming unbounded numbers– intention: use registers

• limited about of register available (platform dependent)


Assumption about temps (here)

• temp’s don’t transfer data across blocks (6= program var’s)⇒ temp’s dead at the beginning and at the end of a block

• but: variables have to be assumed live at the end of a block (block-local analysis, only)

At this point, one can check one’s undestanding: why is it that the variables are assumed live (as opposedto assumed dead, or perhaps assumed a status “I-don’t-know”)?

Intra-block liveness

Code

t1 := a − bt2 := t1 ∗ aa := t1 ∗ t2t1 := t1 − ca := t1 ∗ a

• neither temp’s nor vars in the example are “single assignment”,• but first occurrence of a temp in a block: a definition (but for temps it would often be the case,

anyhow)• let’s call operand: variables or temp’s• next use of an operand:• uses of operands: on the rhs’s, definitions on the lhs’s• not good enough to say “t1 is live in line 4” (why?)

Note: the 3AIC may allow also literal constants as operator arguments; they don’t play a role rightnow. In intermediate code generated the way we disucssed in the previous chapter: temporaries are alwaysgenerated new for each intermediate result, so they would not be reused in the way shown in the example.

In the following, the “next-uses” of operands and variables are arranged in a graph-like manner. As weare treating straight-line code, there are no cycles in that graph. In other words it’s an acyclic graph.That form of graph is also known as DAG: directed acyclic graph. NB: the graph on the next slides don’tuse “arrows” (as would be common in directed graphs). Being acyclic, the is only one direction here,that’s from bottom to top. The incoming edges indicate the dependencies of an intermediate result on it’soperands. Since we are dealing with 3A(I)C, there are two operands (or less), which means, nodes havetypically 2 incoming edges (from below). The nodes are labelled by the operator as well as the targetmemory location (variable or temporary).

The DAG, reading it from bottom to top, represents the “next-use” for each variable/temporary. Asmentioned, each node has at most 2 incoming edges (an in-degree of 2). Since a variable may have morethan 2 next uses, the out-degree may well arbitrarily large. In the example, t1 is used for instance, 3 timesat some point in the code.


DAG of the block

DAG

∗

∗ −

∗

−

a0 b0 c0

a

a t1

t2

t1

Text

• no linear order (as in code), only partial order• the next use: meaningless• but: all “next” uses visible (if any) as “edges upwards”• node = occurrences of a variable• e.g.: the “lower node” for “defining”assigning to t1 has three uses• different “versions” (instances) of t1

DAG / SA

SA = “single assignment”

• indexing different “versions” of right-hand sides• often: temporaries generated as single-assignment already• cf. also constraints + remember AGs

∗

∗ −

∗

−

a0 b0 c0

a2

a1 t11

t02

t01


Intra-block liveness: idea of algo

• liveness-status of an operand: different from lhs vs. rhs in a given instruction• informal definition: an operand is live at some occurrence, if it’s used some place in the future

consider statement x1 := x2 op x3

• A variable x is live at the beginning of x1 := x2 op x3, if1. if x is x2 or x3, or2. if x live at its end, if x and x1 are different variables

• A variable x is live at the end of an instruction,– if it’s live at beginning of the next instruction– if no next instruction

∗ temp’s are dead∗ user-level variables are (assumed) live

Note: the graph on the top left-hand side of the slide is not the same as the DAG shown earlier. At leastnot directly, and it contains analogous information (except that the dag has no line-numbers). But thearrows that added to the code show the next uses. In the dag, it’s directly shown that t01 is used 3 times.In the next-use arrangement, one sees only the resp. next use in terms of line numbers, but indirectly, theinformation that t1 is used 3 times is avaible by the chain of 3 next uses. The chain stops, when t1 isupdated. Since the DAG representation has no notion of “lines”, one cannot talk about “the next use”one after the other, it’s about “all future uses”. However, there is a analogue to the notion of line numberin the DAG, that is the variable used on the left-hand side of the assignment, represented as inner nodes,and disambiguated (in the SSA spirit) by super-scripts. For instance there is t01 and t11, corresponding tothe two lines with t1 on the left-hand side of the assignment. What is missing in the DAG is the lineararrangement of the lines, which assignment is supposed to be executed first, but otherwise: instead of 5lines of code, there are 5 inner nodes of the DAG.

So, the arrows indicates the next uses of a variable, if any. It also indicates if a variable is not used in thefuture (but the special “ground symbol”). However, the start-point of the edges are not all really helpfulin getting an overview. In the first line: the arrow from t1 to t1 in the second line rougly corresponds tothe edge in the DAG (as it goes from a definition (of t1) its next use. However, the edge from a in thefirst line to a in the second line is less motivated: it would correspond to an edge from a “use” to a “nextuse”, but normally one is not interested in that too much. Therefore, one should not “overinterpret” thegraph in the figure too much.

A better representation would be, for each line, pointers from all variables to next uses, not just fromvariables that happen to be mentioned in a line.


Liveness

Previous “inductive” definition

expresses liveness status of variables before a statement dependent on the liveness status of variables aftera statement (and the variables used in the statement)

• core of a straightforward iterative algo• simple backward scan• the algo we sketch:

– not just boolean info (live = yes/no), instead:– operand live?

∗ yes, and with next use inside is block (and indicate instruction where)∗ yes, but with no use inside this block∗ not live

– even more info: not just that but indicate, where’s the next use

Backward scan and SLC

Remember in connection with the given algo for intra-block analysis, i.e. analysis for straight-line code.In the presence of loops/analysing a complete CFG, a simple 1-pass does not suffice. More advancedtechniques (“multiple-scans”) are needed then, which may amount to fixpoint calculations. Doing fixpointcalculations increases the complexity of the problem (And the needed theoretical background). As a furtherside remark: earlier in this chapter we elaborated on the fine line that separates cycles in a graph from thenotion of loops, where loops are a particular well-structured from of cycles. Without going into details: ifone is dealing with cfg’s which are guaranteed to contain only loops (but not proper more general cycles),one can apply special techniques or strategies to deal with the cycles. In particular, one can attack theloops “inside out”. That strategy is possible, as loops (as opposed to cycles) appear “nested”. Attackingthe loops in that manner is more efficient than iterating though the graph without taking the nestingstructure as compass.

Algo: dead or alive (binary info only)

// −−−−− i n i t i a l i s e T −−−−−−−−−−−−−−−−−−−−−−−−−−−−for a l l e n t r i e s : T[ i , x ] := Dexcept : for a l l v a r i a b l e s a // but not temps

T[ n , a ] := L ,//−−−−−−− backward pass −−−−−−−−−−−−−−−−−−−−−−−−−−−−for i n s t r u c t i o n i = n−1 down to 0

l e t c u r r e n t i n s t r u c t i o n at i +1: $x := y \ op\ z$ ;T[ i , o ] := T[ i +1,o ] ( for a l l o )T[ i , x ] := D // note o r d e r ; x can `` equal ' ' y or zT[ i , y ] := LT[ i , z ] := L

end

• Data structure T : table, mapping for each line/instruction i and variable: boolean status of“live”/“dead”

• represents liveness status per variable at the end (i.e. rhs) of that line• basic block: n instructions, from 1 until n, where “line 0” represents the “sentry” imaginary line

“before” the first line (no instruction in line 0)• backward scan through instructions/lines from n to 0


Algo′: dead or else: alive with next use

• More refined information• not just binary “dead-or-alive” but next-use info⇒ three kinds of information

1. Dead: D2. Live:

– with local line number of next use: L(n)– potential use of outside local basic block L(⊥)

• otherwise: basically the same algo

// −−−−− i n i t i a l i s e T −−−−−−−−−−−−−−−−−−−−−−−−−−−−for a l l e n t r i e s : T[ i , x ] := $\ l i v e n e x t d e a d n o n l o c a l $except : for a l l v a r i a b l e s a // but not temps

T[ n , a ] := $\ l i v e n e x t n o n l o c a l $ ,//−−−−−−− backward pass −−−−−−−−−−−−−−−−−−−−−−−−−−−−for i n s t r u c t i o n i = n−1 down to 0

l e t c u r r e n t i n s t r u c t i o n at i +1: $x := y \ op\ z$ ;T[ i , o ] := T[ i +1,o ] ( for a l l o )T[ i , x ] := $\ l i v e n e x t d e a d l o c a l $ // note o r d e r ; x can `` equal ' ' y or zT[ i , y ] := $\ l i v e n e x t l o c a l { i +1}$T[ i , z ] := $\ l i v e n e x t l o c a l { i +1}$

end

Run of the algo′

Run/result of the algo

line a b c t1 t2[0] L(1) L(1) L(4) D D1 L(2) L(⊥) L(4) L(2) D2 D L(⊥) L(4) L(3) L(3)3 L(5) L(⊥) L(4) L(4) D4 L(5) L(⊥) L(⊥) L(5) D5 L(⊥) L(⊥) L(⊥) D D

Picture

t1 := a − bt2 := t1 ∗ aa := t1 ∗ t2t1 := t1 − ca := t1 ∗ a

394 10 Code generation10.4 Code generation algo

In the table, the entries marked read indicate where “changes” occur; remember that the table is filledfrom bottom to top, we are doing a backward scan.

10.4 Code generation algo

Simple code generation algo

• simple algo: intra-block code generation• core problem: register use• register allocation & assignment• hold calculated values in registers longest possible• intra-block only ⇒ at exit:

– all variables stored back to main memory– all temps assumed “lost”

• remember: assumptions in the intra-block liveness analysis

Some make a distinction between register allocation: “should the data be held in register (and how long)”vs. register assignment: “which of the available registers to use for that”.

Limitations of the code generation

• local intra block:– no analysis across blocks– no procedure calls, etc.

• no complex data structures– arrays– pointers– . . .

some limitations on how the algo itself works for one block

• for read-only variables: never put in registers, even if variable is repeatedly read– algo works only with the temps/variables given and does not come up with new ones– for instance: DAGs could help

• no semantics considered– like commutativity: a+ b equals b+ a

The limitation that read-only variables are not put into registers is not a “design-goal”: it’s a not sosmart side-effect of the way the algorithm works. The algo is a quite straightforward way of making use ofregisters which works block-local. Due to its simplicity, the treatment of read-only variables leaves roomfor improvement. The code generation makes use of liveness information, if available. In case one hasinvested in some global liveness analysis (as opposed to a local one discussed so far), the code generationcould profit from that by getting more efficient. But its correctness does not rely on that. Even withoutliveness information at all, it is correct, by assuming conservatively or defensively, that all variables arealways live (which is the worst-case assumption).

We decompose the code generation into two parts, discussed separately: the code generation itself and,afterwards getreg, as auxiliary procedure where to store the result. One may even say, there is a thirdingredient to the code generation, namely the liveness information, which is however, calculated separatelyin advance (and we have discussed that part already). The code generation, though, goes through thestraight-line 3AIC line-by-line and in a forward manner, calling repeatedly getreg as helper function todetermine which register or memory address to use. We start by mentioning the general purpose of thegetreg function, but postpone the realization for afterwards.

10 Code generation10.4 Code generation algo 395

As far as the code generation may is concerned: finally there’s no way around the fact that we need totranslate 3-address lines of code to 2-address instructions. Since the two-address instructions have onesource and the second source is, at the same time, also the destination of the instruction, one operand is“lost”. So, in many cases, the code generation need to save one of its 3 arguments in a first step somewhere,to avoid that one operand is really overwritten. We have gotten a taste of that in the simple examplesearlier used to illustrate the cost model. The “saving place” for the otherwise lost argument is, at the sametime the place where the end result is supposed to be and it’s the place determined by getreg.

Of course, there are situations, when the operand does not need to be moved to the “saving place”. Oneis, obviously, when it’s already there. The register and address descriptors help in determining a situationlike that.

We explain the code generation algo in different levels of details, first without updating the book-keeping,afterwards keeping the books in sync, and finally, also keeping liveness information into account. Still,even the most detailed version hide some details, for instance, if there is more than one location to choosefrom, which one is actually taken. The same will be the case for the getreg function later: some choice-points are left unresolved. It’s not a big deal, it’s not a question of correctness, it’s more a question of howefficient the code (on average) is going to be.

Purpose and “signature” of the getreg function

• one core of the code generation algo• simple code-generation here ⇒ simple getreg

getreg function

available: liveness/next-use info

Input: TAIC-instruction x := y op z

Output: return location where x is to be stored

• location: register (if possible) or memory location

In the 3AIC lines, x, y, and z can also stand for temporaries. Resp. there’s no difference anyhow, so it doesnot matter. Temporaries and variables are different, concerning their treatment for (local) liveness, butthat information is available via the liveness information. For locations (in the 2AC level), we sometimesuse l representing registers or memory addresses.

Coge generation invariant

it should go without saying . . . :

Basic safety invariant

At each point, “live” variables (with or without next use in the current block) must exist in at least onelocation

• another invariant: the location returned by getreg: the one where the result of a 3AIC assignmentends up


Register and address descriptors

• code generation/getreg: keep track of1. register contents2. addresses for names

Register descriptor

• tracking current “content” of reg’s (if any)• consulted when new reg needed• as said: at block entry, assume all regs unused

Address descriptor

• tracking location(s) where current value of name can be found• possible locations: register, stack location, main memory• > 1 location possible (but not due to overapproximation, exact tracking)

By saying that the register descriptor is needed to track the content of a register, we don’t mean to track theactual value (which will only be known at run-time). It’s rather keeping track of the following information:the content of the register correspond to the (current content of the following) variable(s). Note: theremight be situations where a register corresponds to more than one variable in that sense.

Code generation algo for x := y op z

We start with a “textual” version first, followed by one using a little more programming/math notation.One can see the general form of the generated code. One 3AIC line is translated into 2 lines of 2AC or, iflucky, in 1 line of 2AC

1. determine location (preferably register) for result

l = g e t r e g ( ``x := y op z ' ' )

2. make sure, that the value of y is in l :• consult address descriptor for y ⇒ current locations ly for y• choose the best location ly from those (preferably register)• if value of y not in l, generate

MOV $l_y$ , l

3. generate

OP $l_z$ , l // $l_z$ : a c u r r e n t l o c a t i o n o f z ( p r e f e r reg ' s )

• update address descriptor [x 7→∪ l]• if l is a reg: update reg descriptor l 7→ x

4. exploit liveness/next use info: update register descriptors

10 Code generation10.4 Code generation algo 397

Skeleton code generation algo for x := y op z

$ l $ = getreg ( ` `x:= y op z ' ' ) // t a r g e t l o c a t i o n for xi f $ l \ n o t i n \ l o c s o f {y}{\ t a b l e a d }$ then l e t $l_y \ in \ l o c s o f {y}{\ t a b l e a d }$ ) in emit ( "MOV $l_y , \ l $ " ) ;l e t $l_z \ in \ l o c s o f { z }{\ t a b l e a d }$ in emit ( "OP $l_z , l $ " ) ;

• “skeleton”– non-deterministic: we ignored how to choose lz and ly– we ignore book-keeping in the name and address descriptor tables (⇒ step 4 also missing)– details of getreg hidden.

The let ly ∈ . . . notation is meant as pseudo-code notation for non-deterministic choice for, in this case,location l_y from some set of possible candidates. Note the invariant we mentioned: it’s guaranteed, thaty is stored somewhere (at least when still live), so it’s guaranteed that there is at least one ly to pick.

Also note (again), the order of the argument in 2AC. We save y at some location, in the slide called l.That one is mentioned as second argument in the 2AC. But the second argument, which at the same timeis also the destination location may better be thought of as first input. For addition, it may not mattermuch, but for example SUB b a corresponds to a - b (with the result stored in a). Because of that andthhe way, the translation works also makes clear that we save y and not z.

Exploit liveness/next use info: recycling registers

• register descriptors: don’t update themselves during code generation• once set (e.g. as R0 7→ t), the info stays, unless reset• thus in step 4 for z := x op y:

Code generation algo for x := y op z

$ l $ = getreg ( " i : x := y op z " ) // $ i $ for i n s t r u c t i o n s l i n e number/ l a b e li f $ l \ n o t i n \ l o c s o f {y}{\ t a b l e a d }$then l e t $l_y$ = best ( $\ l o c s o f {y}{\ t a b l e a d }$ )

in emit ( " $\ red {\ mathbf{MOV}\ l_y , \ l }$ " )e l s e skip ;l e t $l_z$ = best ( $\ l o c s o f { z }{\ t a b l e a d }$ )in emit ( " $\ red {\ mathbf{OP}\ l_z , l }$ " ) ;$\ t a b l e a d := \ t a b l e a d \ s e t w i t h o u t t o {\_}{ l }$ ;$\ t a b l e a d := \ t a b l e a d \ s e t t o {x}{ l }$ ;i f $ l $ i s a r e g i s t e rthen $\ t a b l e r d := \ t a b l e r d \ s e t t o { l }{x}$ ;

i f $\ l n o t \ t a b l e l i v e a t { i }{y}$ and $\ t a b l e a d ( y ) = r$ then $\ t a b l e r d := \ t a b l e r d \ s e t w i t h o u t t o { r }{y}$i f $\ l n o t \ t a b l e l i v e a t { i }{ z}$ and $\ t a b l e a d ( z ) = r$ then $\ t a b l e r d := \ t a b l e r d \ s e t w i t h o u t t o { r }{ z }$

Updating and exploit liveness info by recycling reg’s

if y and/or z are currently

• not live and are• in registers,

⇒ “wipe” the info from the corresponding register descriptors

• side remark: for address descriptor– no such “wipe” needed, because it won’t make a difference (y and/or z are not-live anyhow)– their address descriptor wont’ be consulted further in the block


In the pseudo-code we make use of some math-like notation. We write Ta and Tr for the 2 tables. Theymay be implemented as arrays or look-up structures. For updating we use notations like Ta[x 7→ l]. Thisis meant to say: after the update, x is stored in l, the old information overwritten. Variables can be storedin different locations, but updating x in such an assignment invalidates all other locations, they becomeout-of-date or stale. The only place where x resides in l. By Ta\(_ 7→ l) we mean, we remove bindings,namely all that mention l.

Since there are situations, where one location can contain (the content of) more than variable, one may alsohave to suppert operations like Tr [l 7→∪ x], meaning that old information (here for l) is not overritten, butanother “binding” is added: after the update, location l contains also (the value) of x, without forgettingthe old values. This is not needed in the translation of our 3AIC instruction, but would occur whentranslating x := y for instance, i.e., copying values.

We could also check whether x is live and do the corresponding wiping for x as well. In which case, thewhole assignment is meaningless, and (as a consequence, also the liveness status of y and z could changein turn. . . ).

As an invariant, a variable never resides in more than one register.

getreg algo: x := y op z

• goal: return a location for x• basically: check possibilities of register uses• starting with the “cheapest” option

Do the following steps, in that order

1. in place: if x is in a register already (and if that’s fine otherwise), then return the register

2. new register: if there’s an unsused register: return that

3. purge filled register: choose more or less cleverly a filled register and save its content, if needed, andreturn that register

4. use main memory: if all else fails

getreg algo: x := y op z in more details

1. if• y in register R• R holds no alternative names• y is not live and has no next use after the 3AIC instruction• ⇒ return R

2. else: if there is an empty register R′: return R′3. else: if

• x has a next use [or operator requires a register] ⇒– find an occupied register R– store R into M if needed (MOV R, M))– don’t forget to update M ’s address descriptor, if needed– return R

4. else: x not used in the block or no suituable occupied register can be found• return x as location l

• choice of purged register: heuristics• remember (for step 3): registers may contain value for > 1 variable ⇒ multiple MOV’s

10 Code generation10.5 Global analysis 399

Sample TAIC

d := (a-b) + (a-c) + (a-c)

t := a − bu := a − cv := t + ud := v + u

line a b c d t u v

[0] L(1) L(1) L(2) D D D D1 L(2) L(⊥) L(2) D L(3) D D2 L(⊥) L(⊥) L(⊥) D L(3) L(3) D3 L(⊥) L(⊥) L(⊥) D D L(4) L(4)4 L(⊥) L(⊥) L(⊥) L(⊥) D D D

Code sequence

• address descr’s: “home position” not explicitely needed.• e.g. variable a to be found “at a ” (if not stale), as indicated in line “0”.• in the table: only changes (from top to bottom) indicated• after line 3:

– t dead– t resides in R0 (and nothing else in R0)→ reuse R0

• Remark: info in [brackets]: “ephemeral”

10.5 Global analysis

From “local” to “global” data flow analysis

• data stored in variables, and “flows from definitions to uses”• liveness analysis

– one prototypical (and important) data flow analysis– so far: intra-block = straight-line code

• related to– def-use analysis: given a “definition” of a variable at some place, where it is (potentially) used

400 10 Code generation10.5 Global analysis

– use-def : (the inverse question, “reaching definitions”• other similar questions:

– has a value of an expression been calculated before (“available expressions”)– will an expression be used in all possible branches (“very busy expressions”)

Global data flow analysis

• block-local– block-local analysis (here liveness): exact information possible– block-local liveness: 1 backward scan– important use of liveness: register allocation, temporaries typically don’t survive blocks anyway

• global: working on complete CFG

2 complications

• branching: non-determinism, unclear which branch is taken• loops in the program (loops/cycles in the graph): simple one pass through the graph does not cut

it any longer

• exact answers no longer possible (undecidable)⇒ work with safe approximations• this is: general characteristic of DFA

Generalizing block-local liveness analysis

• assumptions for block-local analysis– all program variables (assumed) live at the end of each basic block– all temps are assumed dead there.

• now: we do better, info across blocks

at the end of each block:

which variables may be used in subsequent block(s).

• now: re-use of temporaries (and thus corresponding registers) across blocks possible• remember local liveness algo: determined liveness status per var/temp at the end of each “line/in-

struction”

We said that “now” a re-use of temporaries is possible. That is in contrast to the block local analysis we didearlier, before the code generation. Since we had a local analysis only, we had to work with assumptionsconverning the variables and temporaries at the end of each block, and the assumptions were “worst-case”,to be on the safe side. Assuming variables live, even if actually they are not, is safe, the opposite may beunsafe. For temporaries, we assumed “deadness”. So the code generator therefore, under this assumption,must not reuse temporaries across blocks.

One might also make a parallel to the “local” liveness algorithm from before. The problem to be solvedfor liveness is to determined the status for each variable at the end of each block. In the local case, thequestion was analogous, but for the “end of each line”. For sake of making a parallel one could considereach line as individual block. Actually, the global analysis would give identical results also there. Thefact that one “lumps together” maximal sequences of straight-line code into the so-called basic blocks andthereby distinguishing between local and global levels is a matter of efficiency, not a principle, theoreticaldistinction. Remember that basic blocks can be treated in one single path, whereas the whole control-flowgraph cannot: do to the possibility of loops or cycles there, one will have to treat “members” of sucha loop potentially more than one (later we will see the corresponding algorithm). So, before addressingthe global level with its loops, its a good idea to “pre-calculate” the data-flow situation per block, where


such treatment requies one pass for each individual block to get an exact solution. That avoid potentialline-by-line recomputation in case a basic block neeeds to be treated multiple times.

Connecting blocks in the CFG: inLive and outLive

• CFG:– pretty conventional graph (nodes and edges, often designated start and end node)– nodes = basic blocks = contain straight-line code (here 3AIC)– being conventional graphs:

∗ conventional representations possible∗ E.g. nodes with lists/sets/collections of immediate successor nodes plus immediate prede-

cessor nodes• remember: local liveness status

– can be different before and after one single instruction– liveness status before expressed as dependent on status after⇒ backward scan

• Now per block: inLive and outLive

Loops vs. cycles

As a side remark. Earlier we remarked that loops are closely related to cycles in a graph, but not 100% thesame. Some forms of analyses resp. algos assume that the only cycles in the graph are loops. However, thetechniques presented here work generally, i.e., the worklist algorithm in the form presented here works justfine also in the presence of general cycles. If one had no cycles, no loops. special strategies or variationsof the worklist algo could exploit that to achieve better efficiency. We don’t pursue that issue here. Inthat connection it might also be mentioned: if one had a program without loops, the best strategy wouldbe backwards. If one had straight-line code (no loops and no branching), the algo corresponds directly to“local” liveness, explained earlier.

inLive and outLive

• tracing / approximating set of live variables5 at the beginning and end per basic block• inLive of a block: depends on

– outLive of that block and– the SLC inside that block

• outLive of a block: depends on inLive of the successor blocks

Approximation: To err on the safe side

Judging a variable (statically) live: always safe. Judging wrongly a variable dead (which actually will beused): unsafe

• goal: smallest (but safe) possible sets for outLive (and inLive)

5To stress “approximation”: inLive and outLive contain sets of statically live variables. If those aredynamically live or not is undecidable.


Example: Faculty CFG

CFG picture

Explanation

• inLive and outLive• picture shows arrows as successor nodes• needed predecessor nodes (reverse arrows)

node/block predecessorsB1 ∅B2 {B1}B3 {B2, B3}B4 {B3}B5 {B1, B4}

Block local info for global liveness/data flow analysis

• 1 CFG per procedure/function/method• as for SLC: algo works backwards• for each block: underlying block-local liveness analysis


3-valued block local status per variable

result of block-local live variable analysis

1. locally live on entry: variable used (before overwritten or not)2. locally dead on entry: variable overwritten (before used or not)3. status not locally determined: variable neither assigned to nor read locally

• for efficiency: precompute this info, before starting the global iteration ⇒ avoid recomputation forblocks in loops

Precomputation

We mentioned that, for efficiency, it’s good to precompute the local data flow per local block. In thesmallish examples we look at in the lecture or exercises etc.: we don’t pre-compute, we often do it simplyon-the-fly by “looking at” the blocks’ of SLC.

Global DFA as iterative “completion algorithm”

• different names for the general approach– closure algorithm, saturation algo– fixpoint iteration

• basically: a big loop with– iterating a step approaching an intended solution by making current approximation of the

solution larger– until the solution stabilizes

• similar (for example): calculation of first- and follow-sets• often: realized as worklist algo

– named after central data-structure containing the “work-still-to-be-done”– here possible: worklist containing nodes untreated wrt. liveness analysis (or DFA in general)

Example

a := 5L1 : x := 8

y := a + xif_true x=0 g o t o L4z := a + x // B3a := y + zi f _ f a l s e a=0 g o t o L1a := a + 1 // B2y := 3 + x

L5 a := x + yr e s u l t := a + zreturn r e s u l t // B6

L4 : a := y + 8y := 3g o t o L5


CFG: initialization

Picture

• inLive and outLive: initialized to ∅ everywere• note: start with (most) unsafe estimation• extra (return) node• but: analysis here local per procedure, only

Iterative algo

General schema

Initialization start with the “minimal” estimation (∅ everywhere)

Loop pick one node & update (= enlarge) liveness estimation in connection with that node

Until finish upon stabilization (= no further enlargement)

• order of treatment of nodes: in princple arbitrary6

• in tendency: following edges backwards• comparison: for linear graphs (like inside a block):

– no repeat-until-stabilize loop needed– 1 simple backward scan enough

6There may be more efficient and less efficient orders of treatment.


Liveness: run

Liveness example: remarks

• the shown traversal strategy is (cleverly) backwards• example resp. example run simplistic:• the loop (and the choice of “evaluation” order):

“harmless loop”

after having updated the outLive info for B1 following the edge from B3 to B1 backwards (propagatingflow from B1 back to B3) does not increase the current solution for B3

• no need (in this particular order) for continuing the iterative search for stabilization• in other examples: loop iteration cannot be avoided• note also: end result (after stabilization) independent from evaluation order! (only some

strategies may stabilize faster. . . )

In the script, the figure shows the end-result of the global liveness analysis. In the slides, there is a “slide-show” which shows step-by-step how the liveness-information propagates (= “flows”) through the graph.These step-by-step overlays, also for other examples, are not reproduced in the script.


Another, more interesting, example

Example remarks

• loop: this time leads to updating estimation more than once• evaluation order not chosen ideally

Precomputing the block-local “liveness effects”

• precomputation of the relevant info: efficiency• traditionally: represented as kill and generate information• here (for liveness)

1. kill: variable instances, which are overwritten2. generate: variables used in the block (before overwritten)3. rests: all other variables won’t change their status

Constraint per basic block (transfer function)

inLive = outLive\kill(B) ∪ generate(B)

• note:– order of kill and generate in above’s equation– a variable killed in a block may be “revived” in a block

• simplest (one line) example: x := x +1

Order of kill and generate

As just remarked, one should keep in mind the oder of kill and generate in the definition of transferfunctions. In principle, one could also arrange the opposite order (interpreting kill and generatate slightlydifferently). One can also define the so-called transfer function directly, without splitting into kill andgenerate (but for many (but not all) such a separation in kill and generate functionality is possible andconvenient to do). Indeed using transfer functions (and kill and generate) works for many other data flowanalyses as well, not just liveness analysis. Therefore, understanding liveness analysis basically amountsto having understood data flow analysis.


Example once again: kill and gen

408 BibliographyBibliography

Bibliography[1] Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2007). Compilers: Principles, Techniques and

Tools. Pearson,Addison-Wesley, second edition.

[2] Aho, A. V., Sethi, R., and Ullman, J. D. (1986). Compilers: Principles, Techniques, and Tools.Addison-Wesley.

[3] Appel, A. W. (1998a). Modern Compiler Implementation in Java. Cambridge University Press.

[4] Appel, A. W. (1998b). Modern Compiler Implementation in ML/Java/C. Cambridge University Press.

[5] Chomsky, N. (1956). Three models for the description of language. IRE Transactions on InformationTheory, 2(113–124).

[6] Cooper, K. D. and Torczon, L. (2004). Engineering a Compiler. Elsevier.

[7] Hopcroft, J. E. (1971). An n logn algorithm for minimizing the states in a finite automaton. In Kohavi,Z., editor, The Theory of Machines and Computations, pages 189–196. Academic Press, New York.

[8] Kleene, S. C. (1956). Representation of events in nerve nets and finite automata. In Automata Studies,pages 3–42. Princeton University Press.

[9] Louden, K. (1997). Compiler Construction, Principles and Practice. PWS Publishing.

[10] Rabin, M. and Scott, D. (1959). Finite automata and their decision problems. IBM Journal ofResearch Developments, 3:114–125.

[11] Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Communica-tions of the ACM, 11(6):419.

IndexIndex 409

IndexΣ, 33L(G) (language of a grammar), 85L(r) (language of r), 39∅-closure, 68ε, 61ε (empty word), 60ε transition, 60ε-closure, 68ε-production, 124$ (end marker symbol), 128a-successor, 68ε-transition, 643AC

quadruple, 329

abstract interpretation, 343abstract syntax tree, 9, 81, 91, 109, 213accepting state, 49access chaining, 301access link, 286, 300activation record, 282, 285, 288

variable access, 288activation tree, 287, 288acyclic graph, 234Ada, 309address mode, 348Algol 60, 85alphabet, 33, 85

ordered, 43ambiguity, 94, 95

non-essential, 100ambiguity of a grammar, 117ambiguous grammar, 10, 95, 133analysis

global and local, 370architecture of a compiler, 4array, 261array type, 260assembler, 13associativity, 96, 133, 134, 144AST, 81attribute, 216attribute grammar, 212attribute grammars, 216automaton

accepting, 51language, 51semantics, 51

back end, 14Backus-Naur form, 85backward analysis, 370base type, 259basic block, 379, 380basic type, 10, 257

binding, 11, 213bison, 78blank character, 23BNF, 85

extended, 103boot loader, 15booting, 15bootstrapping, 15bottom-up parsing, 162byte code, 14, 15

C, 286, 308C-preprocessor, 6call-by-reference, 309call-by-result, 309call-by-value, 308call-by-value-result, 309call-stack, 285calling convention, 291calling sequence, 291CFG, 85character, 23Chomsky hierarchy, 107classification, 27code

relocatable, 323code generation, 13, 367comlete item, 173command language, 14comment, 55common left factor, 118compactation, 318compiler

architecture, 4fully optimizing, 3phases, 4

compiler compiler, 30complexity, 374compositionality, 64compound type, 259concrete syntax tree, 81conditional, 92conditionals, 93constant folding, 12constant propagation, 12constraint, 122constructor, 263contex-free grammar

emptyness problem, 89context-free grammar, 37, 39, 85control link, 286control-flow graph, 343, 359, 379coroutine, 316cost model, 13, 374, 375cross compilation, 20

410 IndexIndex

cross-compiler, 15CUP, 198

DAG, 234dangling else, 101dangling-else, 194data flow analysis, 212

forward and backward, 370debugging, 6, 14delayed evaluation, 310dependence graph, 234dereference, 264derivation

left-most, 87leftmost, 88right-most, 88, 90

derivation (given a grammar), 87derivation tree, 81determinization, 67, 136DFA, 22

definition, 50dictionary, 240digit, 52directed acyclic graph, 234disk head, 24dynamic link, 286dynamic typing, 257, 258

EBCDIC, 46EBNF, 86, 103, 104, 136, 141, 152efficiency, 373encoding, 23ε-production, 134error localization, 6evaluation order, 234

final state, 49finite state machine, 60finite-state automaton, 7, 48First set, 119first-set, 119flex, 78floating point numbers, 55Follow set, 119follow set, 128, 166follow-set, 119Fortan, 26Fortran, 24, 285forward analysis, 370front end, 14FSA, 22, 48

definition, 49scanner, 49semantics, 51

full employment theorem for compiler writers, 3fully optimizing compiler, 3function

higher-order, 266function type, 260

garbage collection, 316Go, 239grammar, 80, 84

ambiguous, 10, 95, 98, 117context-free, 85L-attributed, 236left-linear, 108start symbol, 128

graphcycle, 234

grep, 64

Halting problem, 3handle, 166hash table, 240heap, 316higher-order function, 266, 316Hopcroft’s partition refinement algorithm, 71Hopcroft’s partitioning refinement algorithm, 70

I/O automaton, 48identifier, 23, 28inductive data type, 263initial item, 173initial state, 49intermediate code, 14, 212intermediate representation, 212interpreter, 14irrational number, 34, 35isolated entry, 383isolated exit, 383item

complete, 173initial, 173

just-in-time compilation, 14

keyword, 23, 27Kripke structure, 48

l-attributed grammar, 236L-value, 330labelled transition system, 48LALR(1), 162language, 33

of a grammar, 88of an automaton, 51

leader, 380left factor, 133left factorization, 118left recursion, 133left-factoring, 132, 141, 156left-linear grammar, 108left-recursion, 132, 134, 142, 156

immediate, 132leftmost derivation, 88letter, 33lex, 78lexem

IndexIndex 411

and token, 27lexeme, 78, 83lexer, 7, 22

classification, 27lexical scanner, 22linear order, 234linker, 282linking convention, 291live variable, 369liveness analysis, 369LL(1), 141LL(1) grammars, 155LL(1) parse table, 156LL(k), 118loader, 282LR(0), 162, 172, 189LR(1), 162

macro expansion, 310may analysis, 370Mealy machine, 48meaning, 51memory layout

static, 283meta-language, 87, 91microsyntax

vs. syntax, 84Milner’s dictum, 258minimization, 71ML, 263Moore machine, 48must analysis, 370

nested procedures, 286NFA, 22, 60

language, 61non-determinism, 49, 54non-deterministic FSA, 60non-terminals, 85nullable, 119nullable symbols, 119number

floating point, 55fractional, 54

numeric costants, 28

object orientation, 15optimization, 3, 12, 13, 368, 373

code generation, 13

parameter passing, 285parse

error, 208parse tree, 9, 81, 85, 90, 91parser, 109

predictive, 141recursive descent, 141

parser generator, 30, 78parsing, 81, 85

bottom-up, 162partial order, 234partition refinement, 71partitioning, 71, 380Pascal, 263pattern matching, 263phases of a compiler, 4PLT scheme, 239pointer

dereference, 264pointer arithmetic, 264pointer type, 260, 264powerset construction, 67pragmatics, 27, 42precedence

Java, 99precedence cascade, 97precendence, 96predictive parser, 141prefix

viable, 170preprocessor, 6priority, 32product type, 262production (of a grammar), 85profiline, 14program length, 13

R-value, 330Racket, 239rational language, 36rational number, 34, 35record type, 260, 261recursion, 285recursive descent parser, 141reference type, 259register, 13

free and occupied, 370register allocation, 371regular definition, 43regular expression, 22, 32, 87

language, 39meaning, 39named, 43precedence, 40semantics, 40syntax, 40

regular expressions, 9, 36regular language, 7relocatable code, 323reserved word, 23, 27return address, 286Rice’s theorem, 3right-most derivation, 88rule (of a grammar), 85run-time environment

stack based, 285run-time type, 258runtime stack, 285

412 IndexIndex

S2S compiler, 4scanner, 7, 22scanner generator, 78scannner, 83scope resolution operator, 247screener, 27search tree, 240semantic analysis, 10semantic rule, 212semantics, 51semantics analysis, 212sentence, 85sentential form, 85, 119separate compilation, 64shift-reduce parser, 163signature, 260Simula, 316simulation

static, 343SLR(1), 162, 189Smalltalk, 315source-to-source compiler, 4stack pointer, 286state diagram, 48static analysis, 212static link, 286, 300static simulation, 343static typing, 256–258string literal, 28, 283struct

tag, 262subset construction, 67subtyping, 260successor state, 49sum type, 262super-optimization, 368suspension, 310symbol, 33symbol table, 33symbolic execution, 343symbols, 33syntactic sugar, 103, 351syntax, 84syntax error, 109syntax tree, 9

abstract, 81abstract vs. concrete, 82concrete, 81

terminal symbol, 83terminals, 85Thompon’s construction, 63Thompson’s construction, 61thunk, 310token, 27, 78, 83tokenizer, 22tombstone diagram, 16topological sorting, 234total order, 234

tractable, 374transition function, 49transition relation, 49tuple type, 260, 262Turing machine, 24type, 10, 11, 213, 256, 258

array, 260basic, 257

type checking, 84, 256type constructor, 257type cosntructor, 260type error, 110type inference, 374type reconstruction, 374type safety, 258type theory, 256

undefinedness, 49union type, 260, 262

value type, 259variant record, 263viable prefix, 170

whitespace, 23, 25, 27word, 33

vs. string, 34worklist, 71, 125, 127worklist algorithm, 125, 127

XML, 213

yacc, 78, 198

coursescript - uio.no

Documents