web viewsyntax tree. symbol table. intermediate code generator. intermediate representation....

1. Introduction Programming languages are notations for describing computations to people and to machines A program must be translated into a form in which it can be executed by a computer The software systems that do this translation are known as compilers This course is about how to design and implement compilers

1.1. Language Processors A compiler is a program that can read a program written in one language – the source language –

and translate it into an equivalent program in another language – the target language (see Figure 1.1)

Figure 1.1: Compiler

A compiler also reports any errors in the source program that it detects during the translation process

If the target program is an executable machine-language program, it can then be called by the user to process input and produce output

Figure 1.2: A language processing system

1

source program Compiler target program

source program

Preprocessor

modified source program

Linker/Loader

relocatable machine code

Assembler

target assembly program

Compiler

target machine code

library filesrelocatable object files

An interpreter is another common kind of language processor that instead of producing a target program as a translation, an interpreter appears to directly execute the operations specified in the source program on input supplied by the user

The target machine-language produced by a compiler is usually much faster than an interpreter at mapping inputs to outputs

An interpreter can usually give better error diagnostics than a compiler, because it executes the source program statement by statement

Several other programs may be needed in addition to a compiler to create an executable program as shown in Figure 1.2.

The task of a preprocessor (a separate program) is collecting modules of a program stored in separate files

It may also expand shorthands, called macros, into source language statements The modified source program is fed to a compiler The compiler may produce an assembly-language program as its output, because assembly language

is easier to produce as an output and easier to debug The assembly language program is then processed by a program called assembler that produces a

relocatable machine code as its output Large programs are often compiled in pieces, so that the relocatable machine code may have to be

linked with other relocatable object files and library files into the code actually runs on the machine The linker resolves external memory addresses, where the code in one file may refer to a location in

another file The loader then puts together all executable object files into memory for execution

1.2. The Structure of a Compiler There are two parts responsible for mapping source program into a semantically equivalent target

program: analysis and synthesis The analysis part breaks up the program into constituent pieces and imposes grammatical structure

on them It then uses this structure to create an intermediate representation of the source program If analysis part detects errors (syntax and semantic), it provides informative messages The analysis part also collects information about the source program and stores it in a data structure

called symbol table, which is passed along with an intermediate representation to the synthesis part The synthesis part constructs the desired target program from the intermediate representation and

the information in the symbol table The analysis part is often called the front end of the compiler; the synthesis part the back end The compilation process operates as a sequence of phases each of which transforms one

representation of the source program into another A typical decomposition of a compiler into phases is shown in Figure 1.3 In practice several phases may be grouped together and the intermediate representations between

need not be constructed explicitly

2

Figure 1.3: Phases of a compiler

3

character stream

Lexical Analyzer

token stream

Syntax Analyzer

syntax tree

target-machine code

Machine-Dependent Code Optimizer

target-machine code

Code Generator

intermediate representation

Machine-Independent Code Optimizer

Semantic Analyzer

Syntax tree

Intermediate Code Generator

intermediate representation

Symbol Table

The symbol table, which stores information about the entire source program, is used by all phases of the compiler

1.2.1. Lexical Analysis The first phase of a compiler is called lexical analysis or scanning The lexical analyzer reads the stream of characters making up the source program and groups them

into meaningful sequences called lexemes For each lexeme the lexical analyzer produces a token of the form:

<token-name, attribute-value>that it passes on to the subsequent phase, syntax analysis

In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the second attribute value points to an entry in the symbol table for this token

For example, suppose a source program contains the following assignment statement

position = initial + rate * 60 (1.1)

The characters in this assignment could be grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer:1. position is a lexeme that would be mapped into a token <id, 1>, where id is an abstract

symbol standing for identifier and 1 points to the symbol table entry for position. The symbol table entry holds information about the identifier, such as its name and type

2. = is a lexeme that is mapped into the token <=>. Since it needs no attribute value, the second component is omitted

3. initial is a lexeme that would be mapped into a token <id, 2>, where 2 points to the symbol table entry for position

4. + is a lexeme that is mapped into the token <+>5. rate is a lexeme that would be mapped into a token <id, 3>, where 3 points to the symbol

table entry for rate6. * is a lexeme that is mapped into the token <*>7. 60 is a lexeme that is mapped into the token <60>

After lexical analysis, the sequence of tokens in equation 1.1 are

<id, 1> <=> <id, 2> <+> <id, 3> <*> <60> (1.2)

In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and multiplication operators, respectively

1.2.2. Syntax Analysis The second phase of a compiler is syntax analysis or parsing The parser uses the tokens produced by the lexical analyzer to create a tree-like intermediate

representation – called syntax tree – that depicts the grammatical structure of the token stream In a syntax tree, each interior node represents operator and the children of the node represent the

arguments of the operation The syntax tree for the previous token stream in Equation 1,2 is shown in Figure 1.4

4

Figure 1.4: Syntax tree This tree shows the order in which the operations in the assignment in (1.1) are to be performed Multiplication is done first, followed by addition, and finally assignment

1.2.3. Semantic Analysis The semantic analyzer uses the syntax tree and the information in the symbol table to check the

source program for semantic consistency with the language definition It also gathers type information and saves it in either the syntax tree or the symbol table, for

subsequent use during intermediate code generation An important part of semantic analysis is type checking, where the compiler checks that each

operator as matching operands, e.g., many programming language definitions require array index to be an integer; the compiler must report error if floating-point number is used instead

A language specification may permit type coercion, e.g., if binary arithmetic operator is applied to integer and floating point operands, the compiler may convert or coerce the integer into a floating-point number

Suppose that position, initial and rate have been declared to be floating-point numbers, and lexeme 60 by itself forms an integer

Semantic analyzer first converts integer 60 to a floating point number before applying *1.2.4. Intermediate Code Generation

After syntax and semantic analysis, many compilers generate an explicit low-level or machine-like, which we can think of as a program for an abstract machine

This intermediate representation should have two properties: it should be easy to produce and it should be easy to translate it into the target machine

One the intermediate representations called three-address code consists of an assembly like instructions with a maximum of three operands per instruction (or at most one operator at the right side of an assignment operator)

Each operand can act like a register The output of the intermediate code generator can consist of the three-address code sequence

t1 = inttofloat(60)t2 = id3 * t1t3 = id2 + t2id1 = t3 (1.3)

5

The compiler must also generate temporary name to hold the value computed by a three-address instruction

1.2.5. Code Optimization The machine-independent code-optimization phase attempts to improve the intermediate code so

that better target code will result Usually better means faster, but other objectives may be desired, such as shorter code or target

code that consumes less power For example, an algorithm generates the intermediate code (1.3), using an instruction per each

operator in the tree representation that comes from the semantic analyzer The optimizer can deduce that the conversion of 60 from integer to floating point can be done once

for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the floating-point number 60.0

Moreover, t3 is used only once to transmit its value to id1 so that the optimizer can transform (1.3) into the shorter sequencet1 = id3 * 60.0id1 = id2 + t1 (1.4)

1.2.6. Code Generation The code generator takes as an input intermediate representation of the source program and maps

it into the target language If the target language is machine language, registers or memory locations are selected for each of

the variables used by the program Then, the intermediate instructions are translated into sequences of machine instructions to

perform the same task A crucial aspect of code generation is the judicious assignment of registers to hold variables For example, using registers R1 and R2, the intermediate code in (1.4) might get translated into the

machine codeLDF R2, id3MULF R2, R2, #60.0LDF R1, id2ADDF R1, R1, R2STF id1, R1 (1.5)

The first operand of each instruction specifies a destination The F in each instruction tells us that it deals with floating-point numbers The code in (1.5) loads the contents of address id3 into register R2, then multiplies it with floating-

point constant 60.0 The # signifies that 60.0 is to be treated as an immediate constant The third instruction moves id2 into register R1 and the fourth adds to it the value previously

computed in register R2 Finally, the value in register R1 is stored into the address id1, so the code correctly implements the

assignment statement (1.1)

6

1.2.7. Symbol-Table Management An essential function of a compiler is to record the variables names used in the source program and

collect information about various attributes of each name These attributes may provide information about the storage allocated for a name, its type, its scope

(where in the program its value can be used), and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (e.g., by value or by reference), and the type returned

The symbol table is a data structure containing a record for each variable name, with fields for attributes of the name

The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly

1.2.8. The Grouping of Phases into Passes The discussion of phases deals with the logical organization of a compiler In an implementation, activities from several phases may be grouped together into a pass that reads

an input file and writes an output file For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and

intermediate code generation into one pass Code optimization may be an optional pass Then there could be a back-end pass consisting of code generation for a particular target machine

1.2.9. Compiler Construction Tools There are tools created to help implement the various phases of a compiler These tools use specialized languages for specifying and implementing specific components, and

many use quite sophisticated algorithm Some commonly used compiler construction tools include:

1. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming language

2. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens of the language

3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and generating intermediate code

4. Code generator generators that produce a code a generator form a collection of rules for translating each operation of the intermediate language into the machine language for a target machine

5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted from one part of the program to each other part. Data flow analysis is key part of code optimization

6. Compiler-construction toolkits that provide an integrated set of routines for constructing various phases of a compiler

7

2. Lexical Analysis2.1. The Role of the Lexical Analyzer

The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as an output a sequence of tokens for each lexeme in the source program

The stream of tokens is sent to the parser for syntax analysis The lexical analyzer also interacts with the symbol table, e.g., when the lexical analyzer discovers a

lexeme constituting an identifier, it needs to enter that lexeme into the symbol table The interactions are suggested in Figure 2.1

Figure 2.1: Interactions between the lexical analyzer and the parser

The following are additional tasks performed by the lexical analyzer other than identifying lexemes: Stripping out comments and whitespace (blank, newline, and tab) Correlating error messages generated by the compiler with the source program by keeping track

of line numbers (using newline characters) Expanding macros in some lexical analyzers2.2. Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms: A token is a pair consisting of a token name and an optional attribute value. The token name is

an abstract symbol representing a kind of lexical unit, e.g., a keyword, an identifier, etc. The token names are input symbols that the parser processes

A pattern is a description of the form that lexemes of a token may take. In case of a keyword as a token, the pattern is the sequence of characters that forms the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token

In many programming languages, the following classes cover most or all of the tokens: One token for each keyword. The pattern for a keyword is the same as the keyword itself Tokens for operators, either individually or in classes such as comparison (for lexemes: <, >, <=,

>=, ==, !=) One token representing all identifiers

8

source program

Lexical Analyzer

Symbol Table

Parser to semantic analysis

token

getNextToken

One or more tokens representing constants, such as numbers and literal strings Tokens for each punctuation symbol, such as left and right parentheses, comma and semicolon2.3. Attributes for Tokens

When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched

This is done by using attribute value that describes the lexeme represented by the token The token name influences parsing decisions, while the attribute value influences translation of

tokens after the parse Practically, a token has one attribute: a pointer to the symbol table entry in which the information

about the token is kept The symbol table entry contains various information about the token such as the lexeme, its type,

the line number in which it was first seen, etc. For example, in assignment statement (in FORTRAN): E = M * C ** 2, the tokens and their attributes

are written as: <id, pointer to symbol-table entry for E> <assign_op> <id, pointer to symbol-table entry for M> <mult_op> <id, pointer to symbol-table entry for C> <exp_op> <number, integer value 2>2.4. Lexical Errors

It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error

For example, if the string fi is encountered for the first time in a C program in the context: fi (a == f(x)) . . .a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or undeclared function identifier since fi is a valid lexeme for the token identifier

When an error occurs, the lexical analyzer recovers by: Skipping (deleting) successive characters from the remaining input until the lexical analyzer can

find a well-formed token (known as panic mode recovery) Deleting extraneous characters from the remaining input Inserting missing characters from the remaining input Replacing an incorrect character by a correct character Transposing two adjacent characters2.5. Input Buffering

Here we will be looking at some ways that the task of reading the source program can be speeded This task is made difficult by the fact that we often have to look one or more characters beyond the

next lexeme before we can be sure we have the right lexeme We shall introduce a two-buffer scheme that handles large lookaheads safely We then consider an improvement involving sentinels that saves time checking for the ends of

buffers

9

Buffer Pairs Because of the amount of time taken to process characters and the large number of characters that

must be processed during compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead to process a single input character

An important scheme involves two buffers that are alternatively reloaded, as suggested in Figure 2.2

E = M * C * * 2 eof

Figure 2.2: Using a pair of input buffers Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes Using one system read command we can read N characters into a buffer, rather than using one

system call per character If fewer than N characters remain in the input file, then a special character, represented by eof,

marks the end of the source file and is different from any possible character of the source program Two pointers to the input are maintained:

1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine

2. Pointer forward scans ahead until a pattern match is found Once the lexeme is determined, forward is set to the character at its right end (involves retracting) Then, after the lexeme is recorded as an attribute value of the token returned to the parser,

lexemeBegin is set to the character immediately after the lexeme just found Advancing forward requires that we first test whether we have reached the end of one of the

buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded bufferSentinels(guard)

If we use the previous scheme, we must check each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer

Thus, for each character read, we must make two tests: one for the end of the buffer, and one to determine which character is read

We can combine the buffer-end test with the test for the current character if we extend each buffer to hold sentinel character at the end

E = M * eof C * * 2 eof eof

10

lexemeBegin

forward

lexemeBegin

forward

Figure 2.3: Sentinels at the end of each buffer The sentinel(guard) is a special character that cannot be part of the source program, and a natural

choice is the character eof Note that eof retains its use as a marker for the end of the entire input Any eof that appears other than at the end of buffer means the input is at an end Figure 2.3 shows the same arrangement as Figure 2.2, but with the sentinels added

2.6. Specification of Tokens Regular expression is an important notation for specifying lexeme patterns While they cannot express all possible patterns, they are effective in specifying those types of

patterns that we need for tokens Refer to the previous course (Formal Language Theory) to recap on regular expressions

Regular Definitions For notational convenience, we may wish to give names to certain regular expressions and use those

names in subsequent expressions, as if the names were themselves symbols If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

d1→r1

d2→r2

…dn→rn

where:1. Each di is a new symbol, not in ∑ and not the same as any other of the d’s, and2. Each ri is a regular expression over the alphabet ∑ ∪ {d1,d2, . . . , di-1}

By restricting ri to ∑ and the previously defined d’s, we avoid recursive definitions, and we can construct a regular expression over ∑, for each r i

We do so by first replacing uses of d1 in r2 (which cannot use any of the d’s except for d1), then replacing uses of d1 and d2 in r3 by r1 and (the substituted) r2, and so on

Finally, in rn we replace each di, for i = 1, 2, . . ., n – 1, by the substituted version of r i, each of which has only symbols of ∑

Example: C identifiers are strings of letters, digits, and underscores. Here is a regular definition:

letter_ →A | B | . . . | Z | a | b | . . . | z | _digit →0 | 1 | . . . | 9id →letter_(letter_ | digit)*

Extensions of Regular Expressions Many extensions have been added to regular expressions to enhance their ability to specify string

patterns Here we mention a few notational extensions that were first incorporated into Unix utilities such as

Lex that are particularly useful in specification of lexical analyzers:1. One of more instances. The unary postfix operator + represents the positive closure of a regular

expression and its language. The operator + has the same precedence and associativity as the operator *. r* = r+ | λ and r+ = rr* = r*r

11

2. Zero or one instance. The unary postfix operator ?, means “zero or one occurrence”. That is r? is equivalent to r | λ. The ? operator has the same precedence and associativity as * and +

3. Character classes. A regular expression a1 | a2 | . . . | an, where ai’s are each symbols of the alphabet, can be replaced by the shorthand [a1a2 . . . an]. More importantly, when a1, a2, . . ., an form a logical sequence, e.g., consecutive uppercase letter, lowercase letters, or digits, we can replace them by a1-an, that is, just the first and the last separated by hyphen. Thus [abc] is shorthand for a | b | c, and [a-z] is shorthand for a | b | . . . | z

The following is a regular definition for unsigned number in Pascal (note that it uses extensions regular expressions)digit→0 | 1 | . . . | 9digits→digit+

number→digits(.digits)?(E[+-]?digits)?2.7. Recognition of Tokens

In the previous section we learned how to express patterns using regular expressions Now, we study how to take the patterns for all the needed tokens and build a piece of code that

examines the input string and a prefix that is a lexeme matching one of the patternsTransition Diagrams

As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized flowcharts, called “transition diagrams”

Transition diagrams have a collection of nodes called states Each state represents a condition that could occur during the process of scanning the input looking

for a lexeme that matches one of several patterns Edges are directed from one state of the transition diagram to another Each edge is labeled by a symbol or a set of symbols If we are in some state s, and the next input symbol is a, we look for an edge out of state s labeled

by a (and perhaps by other symbols, as well). If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to which that edge leads

Here we assume that our transition diagram is deterministic, meaning that there is never more than one edge out a given state with a given symbol among its labels

Some important conventions about the transition diagram are:1. Certain states are said to be accepting or final. These states indicate that a lexeme has been

found, although the actual lexeme may not consist of all positions between the lexemeBegin and forward pointers. Accepting states are indicated by a double circle, and if there is an action to be taken – typically returning a token and an attribute value to the parser – we shall attach that action to the accepting state

2. In addition, if it is necessary to retract the forward one pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. Any number of *s can be attached depending on the number of positions to retract

3. One state is designated the start state, or initial state; it is indicated by an edge labeled “start”, entering from nowhere. The transition diagram always begins in the start state before any input symbols have been read

12

Figure 2.4 is a transition diagram that recognizes that recognizes the lexemes matching the token relop in Pascal

Being on state 0 (initial state), if we see < as the first input symbol, then the lexemes that match the pattern for relop we can only be looking at <, <>, or <=

We therefore go to state 1 and look at the next character. If it is =, then we recognize lexeme <=, enter state 2, and return the token relop with attribute LE, the symbolic constant representing this particular operator

If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an indication that the not-equals operator has been found

On any other character, the lexeme <, and we enter state 4 to return information. Note, however, that state 4 has a * to indicate that we must retract the input on position

On the other hand, if in state 0 the first character we see is =, then this one character must be the lexeme. We immediately return the fact from state 5

The remaining possibility is that the first character is >. Then, we must enter state 6 and decide, on the basis of the next character, whether the lexeme >= (if we next see the sign =), or just > (on any other character

Note that if, in state 0, we see any character besides <, =, or >, we cannot possibly be seeing a relop lexeme, so this transition diagram will not be used

Figure 2.4 Transition diagram for relop

Similarly, Figure 2.5 shows transition diagram for unsigned number in Pascal

13

token

get nexttoken

Figure 2.5: Transition diagram for unsigned numbers in Pascal

3. Syntax Analysis By design every programming language has precise rules that prescribe the syntactic structure of

well-formed programs The syntax of programming language constructs can be specified by context-free grammars or BNF

notation (both are discussed in the previous course) The use of CFGs has several advantages over BNF:

helps in identifying ambiguities a grammar gives a precise yet easy to understand syntactic specification of a programming

language it is possible to have a tool which produces automatically a parser using the grammar a properly designed grammar helps in modifying the parser easily when the language changes3.1. The Role of a Parser

In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in Figure 3.1, and verifies that the string of token names can be generated by the source language

It is expected that the parser reports any syntax errors in an intelligible fashion and to recover from commonly occurring errors to continue processing the remainder of the program

Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the rest of the compiler for further processing

The methods commonly used in compilers can be classified as being either top-down or bottom-up Top-down methods build parse trees from top (root) to the bottom (leaves), while bottom-up

methods start from the leaves and work their way up to the root In either case, the input to the parser is scanned from left to right, one symbol at a time

14

source program

Lexical Analyzer

Symbol Table

Parser intermediate representation

Rest of Front End

parse tree

Figure 3.1: Position of parser in a compiler model3.2. Syntax Error Handling

Many errors that occur during compilation appear to be syntactic The error handler in a parser has goals that are simple to state but challenging to realize:

Report the presence of errors clearly and accurately Recover from each error quickly enough to detect subsequent errors Add minimal overhead to the processing or correct programs

Fortunately, common errors are simple ones, and relatively straight forward error-handling mechanism often suffices

The error handler should report: the place in the source program where the error is detected the type of error (if possible)

Although no error-recovering strategy has proven itself universally acceptable, a few methods have broad applicability

There are four main strategies in error handling: Panic-Mode Recovery: On discovering an error, the parser discards input symbols one at a time

until one of a designated set of synchronizing tokens is found (usually delimiters, such as }, whose role in the source program is clear and unambiguous). While panic-mode correction often skips considerable amount of input without checking it for additional errors, it has the advantage of simplicity and is guaranteed not to go into an infinite loop

Phrase-Level Recovery: On discovering an error, a parser may perform local correction on the remaining input such that the parser is able to continue. A typical local correction is to replace a comma by a semicolon, insert a missing semicolon, etc. We should be careful to avoid not going into infinite loop. Phrase-level replacement has been used in several error-reporting compilers. Its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection

Error Productions: By anticipating common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. A parser constructed from a grammar augmented by these error productions detects the anticipated errors when the error production is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that has been recognized in the input

Global Correction: Ideally, we would like to make as few changes as possible in processing an incorrect input string. Given an incorrect input string x and grammar G, the algorithms will find a parse tree for a related string y, such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest

3.3. FIRST and FOLLOW The construction of both top-down and bottom-up parsers is aided by two functions, FIRST and

FOLLOW, associated with a grammar G

15

During top-down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol

During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens

FIRST(α), for any string of grammar symbols α, is the set of terminals that begin strings derived from α. If α => λ in zero or more steps, then λ is also in FIRST(α)

To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or λ can be added to any FIRST set1. If X is a terminal, then FIRST(X) = {X}2. If X is a nonterminal and X→Y1Y2 . . . Yk is a production for some k ≥ 1, then place α in FIRST(X) if

for some i, α is in FIRST(Yi), and λ is in all of FIRST(Y1), . . ., FIRST(Yi-1); that is Y1 . . . Yi-1 => λ in zero or more steps. If λ is in FIRST(Yj) for all j = 1, 2, . . ., k, then add λ to FIRST(X)

3. If X→λ is a production, then add λ to FIRST(X) Now, we can compute FIRST for any string X1X2 . . . Xn as follows**

Add to FIRST(X1X2 . . . Xn) all non-λ symbols of FIRST(X1) Also add the non-λ symbols of FIRST(X2) if λ is in FIRST(X1); the non-λ symbols of FIRST(X3) if λ is

in FIRST(X1) and FIRST(X2); and so on Finally, add λ to FIRST(X1X2 . . . Xn) if, for all i, λ is in FIRST(Xi)

FOLLOW(A), for a nonterminal A, is the set of terminals α that can appear immediately to the right of A in some sentential form

To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing can be added to any FOLLOW set1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker2. If there is a production A→αBβ, then everything in FIRST(β) except λ is in FOLLOW(B)3. If there is a production A→αB, or a production A→αBβ, where FIRST(β) contains λ, then

everything in FOLLOW(A) is in FOLLOW(B) Compute FIRST and FOLLOW of nonterminals in the following grammar:

E→ TE'E'→+TE' | λT→FT'T'→*FT' | λF→(E) | id3.4. Top-Down Parsing

Top-down parsing can be viewed as the problem of constructing a parse tree for input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first)

Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string At each step of top-down parse, the key problem is that of determining the production to be applied

for a nonterminal, say A. Once A-production is chosen, the rest of parsing consists of “matching” the terminal symbols in the production body with the input string

3.4.1. Recursive-Descent Parsing This method of top-down parsing can be considered as an attempt to find the left most derivation

for an input string

16

It may involve backtracking To construct the parse tree using recursive-descent parsing:

We create one node tree consisting of S Two pointers, one for the tree and one for the input, will be used to indicate where the parsing

process is. Initially, they will be on S and the first input symbol, respectively Then we use the first S-production to expand the tree. The tree pointer will be positioned on the

leftmost symbol of the newly created sub-tree As the symbol pointed by the tree pointer matches that of the symbol pointed by the input

pointer, both pointers are moved to the right Whenever the tree pointer points on a non-terminal, we expand it using the first production of

the non-terminal Whenever the pointers point on different terminals, the production that was used is not correct,

thus another production should be used. We have to go back to the step just before we replaced the non-terminal and use another production

If we reach the end of the input and the tree pointer passes the last symbol of the tree, we have finished parsing

Example: Draw a parse tree for the input string cad using the following grammar S→cAdA→ab | a

3.4.2. Nonrecursive Predictive Parsing A non-recursive predictive parser can be built by maintaining a stack explicitly, rather implicitly via

recursive calls This method, shown in Figure 3.2, uses a parsing table that determines the next production to be

applied The input buffer contains the string to be parsed followed by $ (the right end marker) The stack contains a sequence of grammar symbols with $ at the bottom. Initially, the stack contains

the start symbol of the grammar followed by $. The parsing table is a two dimensional array M[X, a] where X is a nonterminal of the grammar and a

is a terminal or $

17

a + b $

Predictive Parsing

ProgramXYZ$

Parsing Table M

Output

Input

Stack

Figure 3.2: Model of a table-driven predictive parser

The parser program behaves as follows. The program always considers X, the symbol on top of the stack and a, the current input symbol. There are three possibilities:1. X = a = $: the parser halts and announces a successful completion of parsing2. X = a ≠ $: the parser pops x off the stack and advances the input pointer to the next symbol3. X is a nonterminal: the program consults entry M[X, a] which can be an X-production or an error

entry. If M[X, a] = {X→Y1Y2 . . . Yk}, X on top of the stack will be replaced by Y1Y2 . . . Yk (Y1 at the top of the stack). As an output, any code associated with the X-production can be executed. If M[X, a] = error, the parser calls the error recovery method

Construction of a predictive parsing table Follow the following steps to construct the parsing table: For each production A→α of the grammar, do the following:

1. For each terminal a in FIRST(α), add A→α to M[A, a]2. If λ is in FIRST(α), then for each terminal b in FOLLOW(A), add A→α to M[A, b]. If λ is in FIRST(α)

and $ is in FOLLOW(A), add A→α to M[A, $] as well If, after performing the above, there is no production at all in M[A, a], then set M[A, a] to an error

(which we normally represent by an empty entry in the table) Table 3.1 is a parsing table for grammar shown in Section 3.3 constructed using the above algorithm

Table 3.1: Parsing table M for the grammar in Section 3.3

Nonterminal Input Symbolid + * ( ) $

E E→TE' E→TE'E' E'→+TE' E'→λ E'→λT T→FT' T→FT'T' T'→λ T'→*FT' T'→λ T'→λF F→id F→(E)

Figure 3.3 shows moves made by a predictive parser on input id+id*id

Matched Stack Input ActionE$ id+id*id $

TE'$ id+id*id $ output E→TE'FT'E'$ id+id*id $ output T→FT'

idT'E'$ id+id*id $ output F→idid T'E'$ +id*id $ match idid E'$ +id*id $ output T'→λid +TE'$ +id*id $ output E'→+TE'id+ TE'$ id*id $ match +id+ FT'E'$ id*id $ output T→FT'id+ idT'E'$ id*id $ output F→idid+id T'E'$ *id $ match id

18

id+id *FT'E'$ *id $ output T'→*FT'id+id* FT'E'$ id $ match *id+id* idT'E'$ id $ output F→idid+id*id T'E'$ $ match idid+id*id E'$ $ output T'→λid+id*id $ $ output E'→λ

Figure 3.3: Moves made by a predictive parse on input id+id*id3.4.3. LL(1) Grammars

Predictive parsers, that is, recursive descent parsers needing no backtracking, can be constructed for a class of grammars called LL(1)

The first “L” in LL(1) stands for scanning the input from left to right, the second “L” for producing the leftmost derivation, and “1” for using one input symbol of lookahead at each step to make parsing action decisions

The class of LL(1) grammars is rich enough to cover most programming constructs, although care is needed in writing a suitable grammar for the source language

For example, no left-recursive or ambiguous grammar can be LL(1) A grammar G is in LL(1) if and only if whenever A→α | β are two distinct productions of G, the

following conditions hold:1. For no terminal a do both α and β derive strings beginning with a2. At most one of α and β can derive the empty string3. If β => λ in zero or more steps, then α does not derive any string beginning with a terminal in

FOLLOW(A). Likewise, if α => λ in zero or more steps, then β does not derive any string beginning with a terminal in FOLLOW(A)

The first two conditions are equivalent to the statement that FIRST(α) and FIRST(β) are disjoint sets The third condition is equivalent to stating that if λ is in FIRST(β), then FIRST(α) and FOLLOW(A) are

disjoint sets, and likewise if λ is in FIRST(α) Note: LL(1) grammar is a grammar for which the parsing table does not have a multiply-defined

entries Example: Let a grammar G be given by:

S→iEtSS' | aS'→eS | λE→bIs G an LL(1) grammar?

3.4.4. Error Recovery in Predictive Parsing An error is detected in predictive parsing when the terminal on top of the stack does not match the

next input symbol or when there is a nonterminal A on top of the stack and a is the next input symbol and M[A, a] = errorPanic Mode

Panic-mode recovery is based on the idea of skipping over symbols on the input until a token in a selected set of synchronizing tokens appears

Its effectiveness depends on the choice of synchronizing set

19

The sets should be chosen so that the parser recovers quickly from the errors that are likely to occur in practice

Some heuristics are as follows:1. As a starting point place all symbols in FOLLOW(A) into the synchronizing set for nonterminal A.

If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that the parsing can continue

2. It is not enough to use FOLLOW(A) as the synchronizing set for A. For example, if semicolons terminate statements, then keywords that begin statements may not appear in the FOLLOW set of the nonterminal generating expressions. A missing semicolon after an assignment may therefore result in the keyword beginning the next statement being skipped. Often, there is a hierarchical structure on constructs in a language; e.g., expressions appear within statements, which appear within blocks, and so on. We can add to the synchronizing set of a lower-level construct the symbols that begin higher level constructs. For example, we might add keywords that begin statements to synchronizing sets for the non-terminals generating expressions

3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appear in the input

4. If nonterminals can generate the empty string, then the production deriving λ can be used as a default. Doing so may postpone some error detection, but cannot cause an error to be missed. This approach reduces the number of nonterminals that have to be considered during error recovery

5. If the terminal on top of the stack cannot be matched, a simple idea is to pop the terminal, issue a message that the terminal was inserted, and continue parsing. In effect, this approach takes the synchronizing set of a token to consist all other tokens

Phrase-Level Recovery Phrase-level error recovery is implemented by filling in the blank entries in the predictive parsing

table with pointers to error routines These routines may change, insert, or delete symbols on the input and issue appropriate error

messages They may also pop from the stack

3.5. Bottom-Up Parsing Bottom-up parsing corresponds to the construction of a parse tree for an input string beginning at

the leaves (the bottom) and working up towards the root (the top) We will see a general style of bottom-up parsing called shift-reduce parsing

3.5.1. Reductions We can think of shift-reduce parsing process as one of “reducing” a string w to the start symbol of

the grammar At each reduction step a particular substring matching the right side a production (body) is replaced

by the symbol of the left of that production (head), and if the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse

The key decisions in bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds

Example: Consider the grammar

20

S→aABeA→Abc | bB→d

The sentence abbcde can be reduced to S by the following steps:abbcde, aAbcde, aAde, aAbe, S

3.5.2. Handle Pruning Bottom-up parsing during a left-to-right scan of the input constructs a rightmost derivation in

reverse Informally, a “handle” is a substring that matches the body of a production, and whose reduction

represents one step along the reverse of a rightmost derivation Example: Consider the following grammar

E→E+T | TT→T*F | FF→(E) | id

The following steps show the reductions of id*id to Eid*id, F*id, T*id, T*F, T, E

Adding subscripts to the tokens id for clarity, the handles during the parse of id1*id2 are shown in Figure 3.4

Although, T is the body of the production E→T, the symbol T is not a handle in the sentential form T*id2

If T were indeed replaced by E, we would get E*id2, which cannot be derived from the start symbol E Thus, the leftmost substring that matches the body of some production need not be a handle

Right Sentential Form Handle Reducing Productionid1*id2 id1 F→id

F*id2 F T→FT*id2 id2 F→id

T*F T*F T→T*FT T E→T

Figure 3.4: Handles during a parse of id1*id2

Formally, if S =>*rm αAw =>rm αβw, then the production A→β in the position following α is a handle

of αβw Alternatively, a handle of a right sentential form γ is a production A→β and a position of γ where the

string β may be found, such that replacing β at that position by A produces the previous right sentential form in the rightmost derivation of γ

Notice that the string w to the right of the handle must contain only terminal symbols For convenience, we refer to the body β rather than A→β as a handle We say “a handle” rather than “the handle”, because the grammar can be ambiguous A rightmost derivation in reverse can be obtained by “handle pruning”

3.5.3. Shift-Reduce Parsing Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and a

input buffer holds the rest of the string to be parsed

21

As we shall see, the handle always appears at the top of the stack just before it is identified as the handle

$ is used to mark the bottom of the stack and also the right end of the input Initially, the stack is empty, and the string w is on the input as follows:

Stack Input$ w$

During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the stack, until it is ready to reduce a string β of grammar symbols on top of the stack

It then reduces β to the head the appropriate production The parser repeats this cycle until it has detected an error or until the stack contains the start

symbol and the input is empty:Stack Input$E $

Upon entering this configuration, the parser halts and announces successful completion of parsing Figure 3.5 steps through the actions a shift-reduce parser might take in parsing input string id1*id2

Stack Input Action$ id1*id2 $ shift$ id1 *id2 $ reduce by F→id$ F *id2 $ reduce by T→F$ T *id2 $ shift$ T* id2 $ shift$ T*id2 $ reduce by F→id$ T*F $ reduce by T→T*F$ T $ reduce by E→T$ E $ acceptFigure 3.5: Configuration of a shift-reduce parser on input id1*id2

There are four possible actions a shift-reduce parser can make: shift, reduce, accept, and error1. Shift. Shift the next input symbol on the top of the stack2. Reduce. The parser knows the right end of the handle is at the top of the stack. It should then

decide what non-terminal should replace that substring3. Accept. Announce successful completion of parsing4. Error. Discover a syntax error and call an error recovery routine

3.5.4. Conflicts during Shift-Reduce Parsing There are CFGs for which shift-reduce parsing cannot be used Every shift-reduce parser for such a grammar can reach a configuration in which the parser, knowing

the entire stack and also the next k input symbols, cannot decide whether to shift or to reduce (a shift/reduce conflict), or cannot decide which of several reductions to make (a reduce/reduce conflict)

Technically, these grammars are not in the LR(k) class of grammars and are referred to as non-LR The k in LR(k) refers to the number of symbols of lookahead on the input

3.5.5. Introduction to LR Parsing

22

The most prevalent type of bottom-up parsers today is based on a concept called LR(k) parsing; the “L” is for left-to-right scanning of the input, the “R” for constructing a rightmost derivation in reverse, and the k for the number of input symbols of lookahead that are used in making parsing decisions

The cases k = 0 or k = 1 are of practical interest, and we shall only consider LR parsers with k ≤ 1here When (k) is omitted, k is assumed to be 1 This section introduces the basic concepts of LR parsing and the easiest method for constructing

shift-reduce parsers, called “simple LR” (or SLR, for short) Later, two more complex methods – canonical-LR and LALR – that are used in the majority of LR

parsers are introduced

The LR-Parsing Algorithm A schematic representation of an LR parser is shown in Figure 3.6 It consists of an input, an output, a stack, a driver program, and a parsing table that has two parts

(action and goto) The driver program is the same for all LR parsers, only the parsing table changes from one parser to

another The parsing program reads characters from an input buffer one at a time The program uses a stack to store a string of the form s0X1s1X2 . . . Xmsm, where sm is on top Each Xi is a grammar symbol and each si is a symbol called a state Each state symbol summarizes the information contained in the stack below it, and the combination

of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision

Figure 3.6: Model of an LR parser The parsing table consists of two parts: a parsing-action function action and a goto function goto The program driving the LR parser behaves as follows: It determines sm, the state currently on top of the stack, and ai, the current input symbol

23

a1 . . . ai . . . an$

LR Parsing

Programsm

Xm

sm-1

Xm-1

. . . s0 action goto

Output

Input

Stack

It then consults action[sm, ai], the parsing action table entry for state sm and input ai, which can have one of four values:1. shift s, where s is a state,2. reduce by a grammar production A→β,3. accept, and 4. error

The function goto takes a state and grammar symbol as arguments and produces a state A configuration of an LR parser is a pair whose first component is the stack contents and whose

second component is the unexpended input:(s0X1s1X2s2 . . . Xmsm, aiai+1 . . . an$)

This configuration represents a right-sentential form X1X2 . . . Xm ai ai+1 . . . an

is essential the same way as a shift-reduce parser would; only the presence of states on the stack is new

The next move of the parser is determined by reading a i, the current input symbol, and sm, the state on top of the stack, and then consulting the parsing action table entry action[sm, ai]

The configurations resulting after each of the four types of move are as follows:1. If action[sm, ai] = shift s, the parser executes a shift move, entering the configuration

(s0X1s1X2s2 . . . Xmsmais, ai+1 . . . an$)Here the parser has shifted both the current input symbol a i and the next state s, which is given in action[sm, ai], onto the stack; ai+1 becomes the current input symbol

2. If action[sm, ai] = reduce A→β, then the parser executes a reduce move, entering the configuration(s0X1s1X2s2 . . . Xm-rsm-rAs, aiai+1 . . . an$)where s = goto[sm-r, A] and r is the length of β, the right side of the productionHere the parser first popped 2r symbols off the stack (r state symbols and r grammar symbols), exposing state sm-r

The parser then pushed both A, the left side of the production, and s, the entry for goto[sm-r, A], onto the stackThe current input symbol is not changed

3. If action[sm, ai] = accept, parsing is completed4. If action[sm, ai] = error, the parser has discovered an error and call an error recovery routine

State action gotoid + * ( ) $ E T F

0 s5 s4 1 2 31 s6 acc2 r2 s7 r2 r23 r4 r4 r4 r44 s5 s4 8 2 35 r6 r6 r6 r66 s5 s4 9 37 s5 s4 10

24

8 s6 s119 r1 s7 r1 r110 r3 r3 r3 r311 r5 r5 r5 r5

Figure 3.7: Parsing table for expression grammar1. si means shift and stack state i, 2. rj means reduce by a production numbered j,3. acc means accept4. blank means error

Figure 3.7 shows the parsing action and goto functions of an LR parsing table for arithmetic expressions with binary operators + and *:1. E→E+T 2. E→T3. T→T*F4. T→F5. F→(E)6. F→id

On input id*id+id, the sequence of stack and input contents is shown in Figure 3.8

Stack Input Action(1) 0 id*id+id $ shift(2) 0 id 5 *id+id $ reduce by F→id(3) 0 F 3 *id+id $ reduce by T→F(4) 0 T 2 *id+id $ shift(5) 0 T 2*7 id+id $ shift(6) 0 T 2*7 id 5 +id $ reduce by F→id(7) 0 T 2*7 F 10 +id $ reduce by T→T*F(8) 0 T 2 +id $ reduce by E→T(9) 0 E 1 +id $ shift

(10) 0 E 1+6 id $ shift(11) 0 E 1+6 id 5 $ reduce by F→id(12) 0 E 1+6 F 3 $ reduce by T→F(13) 0 E 1+6 id 9 $ reduce by E→E+T(14) 0 E 1 $ accept

Figure 3.8 Moves of LR parser on id*id+idConstructing SLR Parsing Tables

This method is the simplest of the three methods used to construct an LR parsing table It is called SLR (simple LR) because it is the easiest to implement However, it is also the weakest in terms of the number of grammars for which it succeeds A parsing table constructed by this method is called SLR table A grammar for which an SLR table can be constructed is said to be an SLR grammar

LR(0) Item An LR(0) item (item for short) is a production of a grammar G with a dot at some position of the right

side

25

For example for the production A→XYZ yields the four items:A→. XYZA→X.YZA→XY.ZA→XYZ.

For the production A→λ we only have one item:A→.

An item indicates what is the part of a production that we have seen and what we hope to see The central idea in the SLR method is to construct, from the grammar, a deterministic finite

automaton to recognize viable prefixes A viable prefix is a prefix of a right sentential form that can appear on the stack of a shift/reduce

parser If you have a viable prefix in the stack it is possible to have inputs that will reduce to the start

symbol If you don’t have a viable prefix on top of the stack you can never reach the start symbol; therefore

you have to call the error recovery procedure One collection of sets of LR(0) items, which we call the canonical LR(0) collection, provides the basis

for constructing SLR parsers To construct the canonical LR(0) collection for a grammar, we define an augmented grammar and

two functions: closure and goto If G is a grammar with start symbol S, then G’, the augmented grammar for G, is G with a new start

symbol S’ and production S’→S The purpose of this new starting production is to indicate the parser when it should stop parsing and

announce the acceptance of the input, i.e., acceptance occurs when and only when the parser is about to reduce S’→SThe Closure Operation

If I is a set of items of G, then closure(I) is the set of items constructed by two rules:1. Initially, every item in I is added to closure(I)2. If A→α.Bβ is in closure(I) and B→γ is a production, then add B→.γ to I. This rule is applied until

no more new items can be added to closure(I) Intuitively, A→α.Bβ in closure(I) indicates that, at some point in the parsing process, we think we

might next see a substring from Bβ as input If B→γ is a production, we also expect we might see a substring derivable from γ at this point For this reason we also include B→.γ in closure(I) Example: Consider the augmented grammar expression (G1’):

E’→EE→E+TE→TT→T*FT→FF→(E)F→id

26

If I is the set of one item: {[E’→.E]}, then closure(I) contains the items:E’→.EE→.E+TE→.TT→.T*FT→.FF→.(E)F→.idThe Goto Operation

The second useful function is goto(I, X) where I is a set of items and X is a grammar symbol goto(I, X) is defined as the closure of all items [A→αX.β] such that [A→α.Xβ] is in I Intuitively, if I is the set of all items that are valid for some viable prefix γ, then goto(I, X) is the set of

items that are valid for the viable prefix γX Example: If I is the set of two items {[E’→E.], [E→E.+T]}, then goto(I, +) consists of

E→E+.TT→.T*FT→.FF→.(E)F→.id The Sets-of-Items Construction

Below is the algorithm to construct C, the canonical collection of sets of LR(0) items for an augmented (amplified) grammar G’procedure items(G’);begin

C := {closure({[S’→.S]})repeat

for each set of items I in C and each grammar symbol Xsuch that goto(I, X) is not empty and not in C do

add goto(I, X) to Cuntil no more sets of items can be added to C

end Example: Construction of the set of Items for the augmented grammar above G1’

I0 = {[E’→.E], [E→.E+T], [E→.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I1 = goto(I0, E) = {[E’→E.], [E→E.+T]}I2 = goto(I0, T) = {[E→T.], [T→T.*F]}I3 = goto(I0, F) = {[T→F.]}I4 = goto(I0, () = {[F→(.E)], [E→.E+T], [E→.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I5 = goto(I0, id) = {[F→id.]}I6 = goto(I1, +) = {[E→E+.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I7 = goto(I2, *) = {[T→T*.F], [T→.F], [F→.(E)], [F→.id]}I8 = goto(I4, E) = {[F→.(E)], [E→E.+T]}

goto(I4, () = I4;

27

goto(I4, id) = I5;I9 = goto(I6, T) = {[E→E+T.], [T→T.*F]}

goto(I6, F) = I3;goto(I6, () = I4;goto(I6, id) = I5;

I10 = goto(I7, F) = {[T→T*F.]}goto(I7, () = I4;goto(I7, id) = I5;

I11= goto(I8, )) = {[F→(E).]}goto(I8, +) = I6;goto(I9, *) = I7;

Valid Items We say an item A→β1.β2 is valid viable prefix of αβ1 if there is derivation S’ derives in zero or more

steps to αβw and αβw drives in zero or more steps to αβ1β2w If A→β1.β2 is valid, it tells us to shift or reduce when αβ1 is on top of the stack.

Theorem The set of valid items embodies all the useful information that can be gleamed from the stack

SLR Parsing Tables Below is the method used to construct the SLR table This method is the weakest, in terms of number of grammars it succeeds for, of all the three

methods we are going to see but it is also the simplest to implementSLR Table Construction Method1. Construct C = {I0, I1, . . ., IN} the collection of the set of LR(0) items for G’2. State i is constructed from Ii and

a. If [A→α.aβ] is in Ii and goto(Ii, a) = Ij (a is a terminal), then action[i, a] = shift jb. If [A→α.] is in Ii, then action[i, a] = reduce A→α for a in FOLLOW(A) for A≠S’c. If [S’→S.] is in Ii, then action[i, $] = accept

If no conflicting action is created by 1 and 2 the grammar is SLR(1); otherwise it is not3. For all non-terminals A, if goto(Ii, A) = Ij then goto[i, A] = j4. All entries of the parsing table not defined by 2 and 3 are made error5. The initial state is the one constructed from the set of items containing [S’→.S]

Example: Construct the SLR parsing table for the grammar G1’FOLLOW(E) = {+, ), $}FOLLOW(T) = {+, ), $, *}FOLLOW(F) = {+, ), $, *}By following the aforementioned method we find the parsing table used earlier

Example: Construct the SLR parsing table for the following grammar (G2’)S’→SS→L=RS→RL→*RL→id

28

R→LC = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}I0 = {[S’→.S], [S→.L=R], [S→.R], [L→.*R], [L→.id], [R→.L]}I1 = goto(I0, S) = {[S’→S.]}I2 = goto(I0, L) = {[S→L.= R], [R→L.]}I3 = goto(I0, R) = {[S→R.]}I4 = goto(I0, *) = {[L→*.R] [L→.*R], [L→.id], [R→.L]}I5 = goto(I0, id) = {[L→id.]}I6 = goto(I2, =) = {[S→L =.R], [R→.L], [L→.*R], [L→.id]}I7 = goto(I4, R) = {[L→*R.]}I8 = goto(I4, L) = {[R→L.]}

goto(I4, *) = I4

goto(I4, id) = I5

I9 = goto(I6, R) = {[S→L=R.]}goto(I6, L) = I8

goto(I6, *) = I4

goto(I6, id) = I5

FOLLOW(S) = {$}, FOLLOW(R) = {$, =}, and FOLLOW(L) = {$, =} We have shift/reduce conflict since = is in FOLLOW(R) and R→L. is in I2 and goto(I2, =) = I6

G2’ is not an ambiguous grammar. However, it is not SLR This is because the SLR parser is not powerful enough to remember enough left context to decide

whether to shift or reduce when it sees an = The next two methods are capable of solving this problem But, there still are some grammars that cannot be parsed by any of the methods that are seen here

Canonical LR Parsing Let us analyze why we had a shift/reduce conflict with the previous example With the SLR method, state i calls for reduction by A→α if items I i contain item [A→α.] and a is in

FOLLOW(A) In the example, R→L. is in I2 and = is in FOLLOW(R) However, in some situations, when state i appears on top of the stack, the viable prefix βα on the

stack is such that βA cannot be followed by a in a right sentential form Thus, the reduction by A→α would be invalid on input a In the previous example, reducing would be an error In fact the states produced using LR(0) items do not hold enough information It is possible to hold more information in the state to rule out some of these invalid reductions. By

splitting states when necessary, we indicate which symbol can exactly follow a handle An LR(1) item has the form of [A→α.β, a] where a is a terminal or $ The functions closure(I), goto(I, X) and Items (G’) will be slightly different compared to those used

for the production of an SLR parsing tableThe Closure Operation

I is a set of LR(1) items closure(I) is found using the following algorithm:

29

closure(I) := Irepeat

for each item [A→α.Bβ, a] in closure(I), each production B→γ in G’and each terminal b in FIRST(βa) such that [B→.γ, b] is not in closure(I) do

add [B→.γ, b] to closure(I)until no more items can be added to closure(I)

Example: This example uses Grammar G2’ Closure({[S’→.S, $]}) = {[S’→.S, $], [S→.L=R, $], [S→.R, $], [L→.*R, =], [L→.id, =], [R→.L, $], [L→.*R,

$], [L→.id, $]}FIRST($) = {$}FIRST(=R$) = {=}The Goto Operation

goto(I, X) is defined as the closure of all items [A→αX.β, a] such that [A→α.Xβ, a] is in I Example: goto(I0, S) = {[S’→S., $]}

The Sets-of-Items Construction Below is an algorithm to construct C, the canonical collection of sets of LR(1) items for augmented

grammar G’ You can note that it is the same as the one used for the construction of the canonical collection of

sets of LR(0) items; only the goto and closure operations differprocedure Items (G’);begin

C := {closure({[S’→.S, $]})}repeat

for each item of I in C and each grammar symbol X such that goto(I, X)is not empty and not in C do

add goto(I, X) to C;until no more sets of items can be added to C

endC = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}I0 = {[S’→.S, $], [S→.L=R, $], [S→.R, $], [L→.*R, =/$], [L→.id, =/$], [R→.L, $]}I1 = goto(I0, S) = {[S’→S., $]}I2 = goto(I0, L) = {[S→L.= R, $], [R→L., $]}I3 = goto(I0, R) = {[S→R., $]}I4 = goto(I0, *) = {[L→*.R, =/$], [L→.*R, =/$], [L→.id, =/$], [R→.L, =/$]}I5 = goto(I0, id) = {[L→id., =/$]}I6 = goto(I2, =) = {[S→L =.R, $], [R→.L, $], [L→.*R, $], [L→.id, $]}I7 = goto(I4, R) = {[L→*R., =/$]}I8 = goto(I4, L) = {[R→L., =/$]}

goto(I4, *) = I4

goto(I4, id) = I5

I9 = goto(I6, R) = {[S→L=R., $]}I10 = goto(I6, L) = {[R→L., $]}

30

I11 = goto(I6, *) = {[L→*.R, $], [L→.*R, $], [L→.id, $], [R→.L, $]}I12 = goto(I6, id) = {[L→id., $]}

goto(I11, *) = I11

goto(I11, id) = I12

goto(I11, L) = I10

I13 = goto(I11, R) = {[L→*R., $]}Construction of LR Parsing Table

1. Construct C = {I0, I1, . . ., IN} the collection of LR(1) items for G’2. State i of the parser is constructed from state I i. The parsing actions for state i are determined as

follows:a. If [A→α.aβ, b] is in Ii and goto(Ii, a) = Ij (a is a terminal), then action[i, a] = shift jb. If [A→α., a] is in Ii and A≠S’, then action[i, a] = reduce A→αc. If [S’→S., $] is in Ii, then action[i, $] = accept

If there is a conflict, the grammar is not LR(1)3. If goto(Ii, A) = Ij, then goto[i, A] = j4. All entries not defined by (2) and (3) are made error5. The initial state is the set constructed from the item [S’→.S, $] The table constructed using this method is called the canonical LR(1) parsing table

LALR parsing The third LR parsing table that is going to be seen is the LALR parsing table construction This is the most widely used method since the table obtained using this method is much smaller

than that obtained by canonical LR method For example, for a language such as Pascal, the LALR parsing table will have some hundred of states

while the canonical LR parsing table will have several thousands of states Much more grammars are LALR than SLR, but there are only few grammars that are LR but not LALR,

and these grammars can be easily avoided Easy algorithm for constructing an LALR parsing table

Definition We call the core of a set of LR(1) items, the set of the first components of the set of LR(1) items Example: In the canonical set of LR(1) items of the previous example, both I6 and I11 have the

following identical core:{[L→*.R], [L→.*R], [L→.id], [R→.L]}Method

Refer to Algorithm 4.11 on Page 238 of the textbook Exercise: Construct the LALR parsing table for the previous grammar Some remarks for constructing LALR parsing tables: The above technique for constructing LALR tables is not the best one; the above technique takes

more time to construct the LALR table than the canonical LR table In practice, we don’t use the LR(1) items to construct the LALR table We construct directly the LALR items. This takes much less time One other problem is the size of the table. An LALR table may have 20,000 entries In this condition it is important to use compression techniques so that the table occupies less space

31

For example, rather than storing the same information in several entries, a pointer to the information may be stored in the table entries

We can also minimize the space needed to store the table by storing a list of actions rather than storing all the entries even when they are empty

32

-

4. Syntax-Directed Translation Parsing an input to do nothing about it is useless In fact, an input is parsed so that some code is generated, information is displayed, etc. These different actions are done by the semantic actions associated to the different rules of the

grammar To execute these actions the parsing table may or may not be constructed first. Indeed, the best

solution performance-wise is to execute the actions while producing the parse tree4.1. Syntax Directed Definitions

A syntax directed definition is a generalization of the CFG in which each grammar symbol has an associated set of attributes (synthesized and inherited) and rules

An attribute can represent anything we choose (a string, a number, a type, a memory location, etc.) The value of a synthesized attribute is computed from the values of attributes at the children of that

node in the parse tree and itself The value of an inherited attribute is computed from the values of attributes at the siblings and/or

parent of that node in the parse tree Semantic rules calculate the values of attributes hence set-up dependencies between attributes that

will be represented by a graph and itself Terminals can have synthesized attributes, but not inherited attributes The dependency graph enables to find an evaluation order for the semantic rules A parse tree showing the values of the attributes is called an annotated or decorated parse tree Exercise: Example of an attributed grammar:

/* We have two attributes defined for this grammar */Syntax Rules Semantic RulesN→L1.L2 N.v = L1.v + L2.v / (2L2.l)L1→L2 B L1.v = 2 * L2.v + B.v

L1.l = L2.l + 1L→B L.v = B.v

L.l = 1B→0 B.v = 0B→1 B.v = 1

In the above example, everything is calculated from leaves to root; all attributes are synthesized Thus all the attributes can be calculated in a single pass Exercise: Example of another attributed grammar:

/* We have three attributes defined for this grammar */Syntax Rules Semantic RulesN→L1 . L2 N.v = L1.v + L2.v

L1.s = 0L2.s = -L2.l

L1→L2 B L1.l = L2.l + 1

33

L2.s = L1.s + 1B.s = L1.sL1.v = L2.v + B.v

L→B L.v = B.vL.l = 1B.s = L.s

B→0 B.v = 0B→1 B.v = 1 * 2B.s

Exercise: Draw the decorated parse tree for input 1011.01Formal definition

In a syntax directed definition, each grammar production A→α has associated with it a set of semantic rules of the form:b := f (c1, c2, . . ., ck) where f is a function and b, c1, . . ., ck are attributes of A and the symbols at the right side of the production

We say that: b is synthesized attribute of A if c1, c2, . . ., ck are attributes belonging to the grammar symbols of

the production and, b is inherited attribute of one of the grammar symbols on the right side of the production if c1,

c2, . . ., ck are attributes belonging to the grammar symbols of the production and, In either case, we say that attribute b depends on attributes c1, c2, . . ., ck

Exercice: Attributed grammar that calculate the value of the expression,Note: id.value is the attribute of id that gives its valueSyntax Rules Semantic RulesE→E + T E1.v := E2.v + T.vE→T E.v := T.vT→T * F T1.v := T2.v * F.vT→F T.v := F.vF→id F.v := id.valueF→(E) F.v := E.v

Exercise: Attributed grammar that associates to an identifier its typeSyntax Rules Semantic RulesD→T L L.in := T.typeT→real T.type := realL→ident ident.type := L.inL1→L2, ident L2.in := L1.in

ident.type := L1.in Note: Here, the semantic action of an ident may have the side effect of adding the type in the

symbol table for that particular identifierDependency Graph

The attributes should be evaluated in a given order because they depend on one another The dependency of the attributes is represented by a dependency graph

b(j) ----D()---> a(i) if and only if there exists a semantic action such as a(i) := f (... b (j) ...)

34

Algorithm for the construction of the dependency graphfor each node n in the parse tree do

for each attribute a of the grammar symbol at n doConstruct a node in the dependency graph for a

for each node n in the parse tree dofor each semantic rule b := f (c1, c2, . . ., ck) associated with the production used at n do

for i := 1 to k doConstruct an edge from the node for ci to the node for b;

Evaluation Order A topological sort of a directed acyclic graph is any ordering m1, m2, . . ., mk of the nodes of the graph

such that edges go from nodes earlier in the ordering to later nodes. i.e., if m i→mj is an edge from mi to mj then mi appears before mj in the ordering

Note: It is not possible to find such an order for a cyclic graph Several methods have been proposed for evaluating semantic rules:

1. Parse rule based methods: for each input, the compiler finds an evaluation order. These methods fail only if the dependency graph for that particular parse tree has a cycle

2. Rule based methods: the order in which the attributes associated with a production are evaluated is predetermined at compiler-construction time. For this method, the dependency graph need not be constructed

3. Oblivious methods: the evaluation order is chosen without considering the semantic rules. This restricts the class of syntax directed definition that can be used

4.2. Some Classes of Non-Circular Attributed Grammars4.2.1. S-Attributed Grammars

We say that an attributed grammar is S-Attributed when all of its attributes are synthesized; i.e., it doesn't have inherited attributes. Synthesized attributes can be evaluated by a bottom-up parser as the input is being parsed

A new stack will be maintained to store the values of the attributes as in the following example Example:

E→E1 + E2 {E.v = E1.v + E2.v}Val(newTop) = Val(oldTop) + Val(oldTop – 2)$$ = $1 + $3 (in Yacc)

We assume that the synthesized attributes are evaluated just before each reduction Before the reduction, attributes of E is in Val(Top) and attributes of E1 and E2 are in Val (Top – 1) and

Val(Top - 2), respectively After the reduction, E is put at the top of the State stack and its attribute values are put at the top of

Val The semantic actions that reference the attributes of the grammar will in fact be translated by the

Compiler generator (such as Yacc) into codes that reference the value stack4.2.2. L-Attributed Grammars

It is difficult to execute the tasks of the compiler just by synthesized attributes The L-attributed (L stands for left) class of grammar allows a limited kind of inherited attributes

35

Definition: A grammar is L-Attributed if and only if for each rule X0→X1 X2 . . . Xj . . . XN, all inherited attributes of Xj depend only on:1. Attributes of X1, . . ., Xj-1

2. Inherited attributes of X0

Of course all S-attributed grammars are L-attributed Example:

A→L M {L.h = f1(A.h)M.h = f2(L.s)A.s = f3(M.s)}

This production does not contradict the rules of L-attributed grammars. Therefore, the corresponding grammar may be L-attributed if all of the other productions follow the rule of L-attributed grammars

Example: A→Q R {R.h = f4(A.h)

Q.h = f5(R.s)A.s = f6(Q.s)}

The grammar containing this production is not L-Attributed since Q.h depends on R.s which contradicts with the rule of L-attributed grammars

36

5. Type Checking The compiler must check if the source program follows semantic conventions of the source language This is called static checking (to distinguish it from dynamic checking executed during execution of

the program) Static checking ensures that certain kind of errors are detected and reported The following are examples of static checks:

1. Type Checks. A compiler should report an error if an operator is applied to an incompatible operand

2. Flow-of-Control Checks. Statements that cause flow of control to leave a construct must have some place to which to transfer the flow of control. For example, a break instruction in C that is not in an enclosing statement

3. Uniqueness Checks. There are situations in which an object must be defined exactly once. For example, in Pascal, an identifier must be declared uniquely, labels in a case statement must be distinct, and elements in a scalar type may not be repeated

4. Name-Related Checks. Sometimes, the same name must appear two or more times . For example, in Ada, a loop or block may have a name that appears at the beginning and end of the construct. The compiler must check that the same name is used at both places

A type checker verifies that the type construct matches that expected by its context. For example, a type checker should verify that the type value assigned to a variable is compatible with the type of the variable

Type information produced by the type checker may be needed when the code is generated It is especially important for overloaded and polymorphic functions An overloaded function represents different operations in different context (e.g., the + operator) A polymorphic function can be executed with arguments of several types

5.1. Type Systems In almost all languages, types are either basic or constructed In Pascal, basic types are boolean, character, integer and real Enumearted types and subranges can also be considered as basic types Pascal also allows a programmer to construct types from basic and other constructed types Types produced in such a way are called constructed types For example, arrays, records and sets are constructed types

Type Expressions The type of a language construct will be denoted by a type expression A type expression is either a basic type or formed by applying an operator called type constructor to

the type expression In this chapter the following are type expressions:

1. A basic type is a type expression2. A type name is a type expression3. A type constructor applied to a type expression is a type expression. Constructors include:

a. Arrays. If I in an index set and T is a type expression, then array (I, T) is a type expression

37

b. Products. If T1 and T2 are type expressions, the Cartesian product T1 × T2 is a type expression. × is left associative

c. Records. The difference between products and records is that records is that record fields have namesFor example, in Pascal:

type row = recordaddress: integer;lexeme: array [1 .. 15] of char;

end;var table : array [1 .. 101] of row;

The type expression corresponding to the above type will be:record ((address × integer) × (lexeme × array (1..15, Char)))

d. Pointers. If T is a type expression then pointer(T) is a type expressione. Functions. The type expression of a function has the form D→R where D is the type

expression of the parameters and R is the type expression of the returned value. For example, the type expression corresponding to the Pascal declaration

function f (a, b : char): ^integer:is

char × char→pointer (integer)A convenient way to represent a type expression is to use a graph (tree or DAG). For example, the type expression corresponding to the above function declaration can be represented with the tree shown on page 347 of your book

4. Type expression may contain variables whose values are type expressionsType Systems

A type system is a collection of rules implemented by the type checker for assigning type expressions to the various parts of a program

Different type systems may be used by different compilers for the same language For example, some compilers implement stricter rules than others In UNIX system, for instance, the lint, for instance, examines C programs for possible bugs using a

more detailed type system that the C compiler itself usesError Recovery

At the very least, the compiler must report the nature and location of errors It is also desirable that the type checker recovers from errors and continues parsing the rest of the

input The inclusion of error handling may result in a type system that goes beyond the one needed to

specify correct programs For example, we may not know the type of incorrectly formed program fragments. In this case,

techniques similar to those needed for languages that do not require identifiers to be declared will have to be used to cope with missing information5.2. Specification of a simple type checker

The following example gives the type system for a Pascal grammar specified in a syntax directed manner

38

Expressions:E→literal {E.type := char}E→num {E.type := integer}E→id {E .type := lookup (id.entry)}E→E1 mod E2 {E.type := if E1.type = integer and E2.type = integer then

integerelse

type_error}E→E1[E2] {E.type := if E2.type = integer and E1.type = array(s, t) then

telse

type_error}E→^E1 {E.type := if E2.type = pointer(t) then

telse

type_error}Statements:S→id := E {S.type := if id.type = E.type then

voidelse

type_error}S→if E then S1 {S.type := if E.type = boolean then

S1.typeelse

type_error}S→while E do S1 {S.type := if E.type = boolean then

S1.typeelse

type_error}S→S1 ; S2 {S.type := if S1.type = void and S2.type = void then

voidelse

type_error}The following example gives type checking system for function calls:E→E1 (E2) {E.type := if E2.type = s and E1.type = S→t then

telse

type_error}

5.3. Equivalence of Type Expressions In the above sections we compared the type expressions using the equal operators However, such an operator is not defined except perhaps between basic types In fact we should rather use the equivalent operator that is more appropriate A natural notion of equivalence is structural equivalence: two type expressions are structurally

equivalent if and only if they are the same basic type or are formed by applying the same constructor to structurally equivalent types

39

For example, integer is equivalent only to integer and pointer (integer) is structurally equivalent to pointer (integer)

Some relaxing rules are very often added to this notion of equivalence For example, when arrays are passed as parameter, the array boundaries of the effective

parameters may not be the same as those of the formal parameters. In some languages, types may be given names. For example in Pascal we can define:

typeptr = înteger

varA : ptr;B : ptr;C : înteger;D, E : înteger;

Do the variables A, B, C, D, E have the same type? Surprisingly, the answer varies from implementation to implementation When names are allowed in type expressions, a new notion of name equivalence is introduced:

name equivalence We have name equivalence between two type expressions if and only if they are identical Under structural equivalence, names are replaced by type expressions they define, so two types are

structurally equivalent if they represent two structurally equivalent type expressions when all names have been substituted

For example, ptr and pointer (integer) are not name equivalent but they are structurally equivalent Note: Confusion arises from the fact that many implementations associate an implicit type name

with each declared identifier Thus, C and D of the above example may not be name equivalent

5.4. Type Conversions Consider expressions like x + i. Where x is of type real and i of type integer Of course, the machine cannot execute this operation as it involves different types of values However, most languages accept such expressions to be used; the compiler will be in charge of

converting one of the operand into the type of the other The type checker can be used to insert these conversion operations into the intermediate

representation of the source program For example, an operator inttoreal may be inserted whenever an operand need to implicitly

converted

40

6. Intermediate Code Generation In a compiler, the front end translates a source program into an intermediate representation, and

the back end generates the target code from this intermediate representation The use of a machine independent intermediate code (IC) is:

retargeting to another machine is facilitated the optimization can be done on the machine independent code

In this chapter, we consider that the error checking is done in another pass, even though in practice the IC generation and the type checking can be done at the same time6.1. Intermediate Languages

Syntax trees, postfix notations, and three-address code are intermediate representationsSyntax Tree

A syntax tree (abstract tree) is a condensed form of parse tree useful for representing language constructs

For example, for the string a + b, the parse tree in (a) below will be represented by the syntax tree shown in (b); in a syntax tree, operators and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree

a. Parse tree for a + b b. Abstract tree for a + bPostfix Notation

The postfix notation is practical for an intermediate representation as the operands are found just before the operator

In fact, the postfix notation is a linearized representation of a syntax tree e.g., 1 + 2 * 3 will be represented in the postfix notation as 1 2 + 3 *

Three-Address code The three address code is a sequence of statements of the form:

X := Y op Zwhere: X, Y, and Z are names, constants or compiler-generated temporaries op is an operator such as integer or floating point arithmetic operator or logical operator on Boolean data

Important Notes: No built-up arithmetic operator is permitted Only one operator at the right side of the assignment is possible, i.e., x + y + z is not possible Similarly to postfix notation, the three-address code is a linearized representation of a syntax

tree. It has been given the name three-address code because such an instruction usually contains three-addresses (the two operands and the result)

Types of Three-Address Statements

41

As with an assembler statement, the three-address code statement can have: symbolic labels, as well as statements for control flow

Common three-address code statements:

Statement Format Comments1. Assignment (binary operation) X := Y op Z Arithmetic and logical operators used2. Assignment (unary operation) X := op Y Unary -, not, conversion operators used3. Copy statement X := Y4. Unconditional jump goto L5. Conditional jump If X relop y goto L6. Function call

param X1

param X2

…param Xn

call p, n

The parameters are specified by param

The procedure p is called by indicating the number of parameters n

7. Indexed arguments X := Y [I]

Y[I] := X

X will be assigned the value at the address Y + IThe value at the address Y + I will be assigned X

8. Address and pointer assignments

X := & YX := *Y

*X = Y

X is assigned the address of YX is assigned the element at the address YThe value at the address X is assigned Y

The choice of allowable operators is an important issue in the design of an intermediate form It should be rich enough to implement the operations of the source language and yet it should not

be too complicated to be translated in the target languageSyntax-Directed Translation into Three-Address Code

Syntax directed translation can be used to generate the three-address code Generally, either the three-address code is generated as an attribute of the attributed parse tree or

the semantic actions have side effects that write the three-address code statements in a file When the three-address code is generated, it is often necessary to use temporary variables and

temporary names To this end the following functions are given:

newtemp - each time this function is called, it gives distinct names that can be used for temporary variables

newlabel - each time this function is called, it gives distinct names that can be used for label names

In addition, for convenience, we use the notation gen to create a three-address code from a number of strings

gen will produce a three-address code after concatenating all the parameters

42

For example, if id1.lexeme = x, id2.lexeme =y and id3.lexeme = z: gen (id1.lexeme, ‘:=’, id2.lexeme, ‘+’, id3.lexeme) will produce the three-address code: x := y + z

Note: variables and attribute values are evaluated by gen before being concatenated with the other parameters

Example: Generation of the three-address code for an assignment statement and an expression

Production Semantic RulesS→id := E S.code := E.code || gen (id.lexeme, :=, E.place)E→E1 + E2 E.place := newtemp;

E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)E→E1 * E2 E.place := newtemp;

E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)E→- E1 E.place := newtemp;

E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)E→(E1) E.place := newtemp;

E.code := E1.codeE→id E.place := id.lexeme;

E.code := ‘’ /* empty code */ the attribute place will hold the value of the grammar symbol the attribute code will hold the sequence of three-address statements evaluating the grammar

symbol The function newtemp returns a sequence of distinct names t1, t2, . . . in response to successive

calls The three-address code produced for the input a:= x + y * z will be:

t1 := y * zt2 := x + t1a := t2

This is found by concatenating, at the semantic action of each rule, the code that has been previously calculated to the code of this particular rule

For example, E→E1 + E2 should add the variable where the result of E1 is and the variable where the result of E2 is and put the result in a temporary variable

We can determine the names of the variables that contain the results of E1 and E2 by using synthesized attributes named place

Example: Semantic rules generating code for a while statement

Production Semantic RulesS→while E do S1 S.begin := newlabel;

S.after := newlabel;S.code := gen (S.begin, ‘:’) || E.code || gen (‘if’, E.place, ‘= 0 goto ’, S.after) || S1.code || gen (S.after, ‘:’)

The above semantic action will create the three-address code of the following form:L1 :

E.codeif E.place = 0 goto L2

43

S1.codegoto L1

L2:

6.2. Declarations The declaration is used by the compiler as a source of type-information that it will store in the

symbol table While processing the declaration, the compiler reserves memory area for the variables and stores

the relative address of each variable in the symbol table The relative address consists of an address from the static data area We use in this section a number of variables, attributes and procedure that help the processing of

the declaration The compiler maintains a global offset variable that indicates the first address not yet allocated Initially, offset is assigned 0. Each time an address is allocated to a variable, the offset is incremented

by the width of the data object denoted by the name The procedure enter(name, type, address) creates a symbol table entry for name, give it the type

type and the relative address address The synthesized attributes name and width for non-terminal T are also used to indicate the type and

number of memory units taken by objects of that type Example: Semantic actions for the declaration part We consider that an Integer and a pointer occupy four bytes and a real number occupies 8 bytes of

memory

Production Semantic RulesP→{offset := 0} D S

D→D; D

D→id :T {enter(id.name, T.type, offset);offset := offset + T.width}

T→integer {T.type := integer;T.width := 4}

T→real {T.type := real;T.width := 8}

T→array[num] of T1 {T.type := array (num.val, T1.type);T.width := num.val * T1.width}

T→^T1 {T.type := pointer (T1.type);T.width := 4}

Note: In languages where nested procedures are possible, we must have several symbol tables, one for each procedure6.3. Assignment Statements

Using the symbol table, below we are going to see how it is possible to generate the three-address code statements corresponding to assignments

44

In an earlier example, we have generated three-address code statements where variables are represented by their names

However, it is more common and practical for the implementation to represent the variables by their symbol table entries

The function lookup(lexeme) checks if there is an entry for this occurrence of the name in the symbol table, and if so a pointer to the entry is returned; otherwise nil is returned.

The newtemp function will generate temporary variables and reserve a memory area for the variables by modifying the offset and putting in the symbol table the reserved memories’ addresses

Example: generation of the three-address code for the assignment statement and simple expressions

Production Semantic RulesS→id := E p := lookup (id.name);

S.code := E.code || If p <> nil then gen (p.lexeme, ‘:=’, E.place) else errorE→E1 + E2 E.place := newtemp;

E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)E→E1 * E2 E.place := newtemp;

E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)E→- E1 E.place := newtemp;

E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)E→(E1) E.place := newtemp;

E.code := E1.codeE→id p := lookup (id.lexeme)

if p <> nil then E.place = p.lexeme else errorE.code := ’’ /* empty code */

Addressing Array Elements The elements of an array are stored in a block of consecutive locations If the width of each array element is w, the relative address of the array is base and the lower bound

of the index is low, then the ith element of the array is found at the address:base + (i – low) * w

For example for an array declared as A : array [5..10] of integer; if it is stored at the address 100, A[7] = 100 + (7 – 5) * 4

When the index is constant as above, it is possible to evaluate the address of A[i] at compile time However, most of the time the value of the index is not determined at compile time and the

compiler can only produce the three-address code statements that will calculate the address at execution time

However, even in this case, some of the calculation can be done at compile time Indeed, base + (i – low) * w = i * w + (base – low * w) base – low * w can be calculated at compile time and save time for the execution Note: The function width(arrayname) returns the width of the array called arrayname by looking in

the symbol table The function base(arrayname) returns the base of the array called arrayname by looking in the

symbol table

45

Example: generation of the three-address code for the assignment statement and expressions (including arrays references)

Production Semantic RulesS→L := E if L.offset = nil then /* L is a simple id */

S.code := L.code || E.code || gen (L.place, ‘:=’, E.place);else

S.code := L.code || E.code || gen (L.place, ‘[’, L.offset, ‘] :=’, E.place)

E→E1 + E2 E.place := newtemp;E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)

E→E1 * E2 E.place := newtemp;E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)

E→- E1 E.place := newtemp;E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)

E→(E1) E.place := newtemp;E.code := E1.code

E→L if L.offset = nil then /* L is simple */begin

E.place := L.place;E.code := L.code

endelse

beginE.place := newtemp;E.code := L.code || gen (E.place, ‘ :=’, L.place, ‘[’ , L.offset, ‘]’)

endL→id [E] L.place := newtemp;

L.offset := newtemp;L.code := E.code || gen (L.place, ‘:=’, base (id.lexeme) – width (id.lexeme) *low(id.lexeme)) || gen (L.offset, ‘:=’, E.place, ‘*’, width (id.lexeme));

L→id p := lookup (id.lexeme);if p <> nil then L.place = p.lexeme else error;L.offset := nil; /* for simple identifier */L.code := ‘’ /* empty code */

Example: Three-address code generation for the input X := A [y] A is stored at the address 100 and its values are integers (width = 4) and low = 1 The semantic actions will generate the following three-address code

t1 := 96t2 := y * 4t3 := t1[t2]x := t3

Exercise: produce the attributed parse tree (decorated parse tree) Example: three-address code generation for the input tab1[i + k] := x + tab2[j] tab1 is stored at the address 100 and its values are integers tab2 is stored at the address 200 and its values are integers

46

The semantic actions will generate the following three-address codet4 := i + kt5 := 96t6 := t1 * 4t1 := 196t2 := j * 4t3 := t1 + t2t5 [t6] := t3

Exercise: produce the attributed parse tree (decorated parse tree)6.4. Backpatching

The main problem for generating code for control statements in a single pass is that, during one single pass, we may not know the labels where the control must go at the time the jump statements are generated

We can solve this problem by generating jump statements where the targets are temporarily left unspecified

Each such statement will be put on a list of goto statements whose labels will be filled when determined

We call this backpatching and it is widely used in three-address code generation

47

7. Code Generation The final phase in our compiler model is the code generator It takes as an input an intermediate representation of the source program and produces as an

output an equivalent target program, as indicated in Figure 7.1

Figure 7.1: Position of code generator

8. Code Optimization

48

web viewsyntax tree. symbol table. intermediate code generator. intermediate representation....

Documents