web viewsyntax tree. symbol table. intermediate code generator. intermediate representation....
TRANSCRIPT
1. Introduction Programming languages are notations for describing computations to people and to machines A program must be translated into a form in which it can be executed by a computer The software systems that do this translation are known as compilers This course is about how to design and implement compilers
1.1. Language Processors A compiler is a program that can read a program written in one language – the source language –
and translate it into an equivalent program in another language – the target language (see Figure 1.1)
Figure 1.1: Compiler
A compiler also reports any errors in the source program that it detects during the translation process
If the target program is an executable machine-language program, it can then be called by the user to process input and produce output
Figure 1.2: A language processing system
1
source program Compiler target program
source program
Preprocessor
modified source program
Linker/Loader
relocatable machine code
Assembler
target assembly program
Compiler
target machine code
library filesrelocatable object files
An interpreter is another common kind of language processor that instead of producing a target program as a translation, an interpreter appears to directly execute the operations specified in the source program on input supplied by the user
The target machine-language produced by a compiler is usually much faster than an interpreter at mapping inputs to outputs
An interpreter can usually give better error diagnostics than a compiler, because it executes the source program statement by statement
Several other programs may be needed in addition to a compiler to create an executable program as shown in Figure 1.2.
The task of a preprocessor (a separate program) is collecting modules of a program stored in separate files
It may also expand shorthands, called macros, into source language statements The modified source program is fed to a compiler The compiler may produce an assembly-language program as its output, because assembly language
is easier to produce as an output and easier to debug The assembly language program is then processed by a program called assembler that produces a
relocatable machine code as its output Large programs are often compiled in pieces, so that the relocatable machine code may have to be
linked with other relocatable object files and library files into the code actually runs on the machine The linker resolves external memory addresses, where the code in one file may refer to a location in
another file The loader then puts together all executable object files into memory for execution
1.2. The Structure of a Compiler There are two parts responsible for mapping source program into a semantically equivalent target
program: analysis and synthesis The analysis part breaks up the program into constituent pieces and imposes grammatical structure
on them It then uses this structure to create an intermediate representation of the source program If analysis part detects errors (syntax and semantic), it provides informative messages The analysis part also collects information about the source program and stores it in a data structure
called symbol table, which is passed along with an intermediate representation to the synthesis part The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table The analysis part is often called the front end of the compiler; the synthesis part the back end The compilation process operates as a sequence of phases each of which transforms one
representation of the source program into another A typical decomposition of a compiler into phases is shown in Figure 1.3 In practice several phases may be grouped together and the intermediate representations between
need not be constructed explicitly
2
Figure 1.3: Phases of a compiler
3
character stream
Lexical Analyzer
token stream
Syntax Analyzer
syntax tree
target-machine code
Machine-Dependent Code Optimizer
target-machine code
Code Generator
intermediate representation
Machine-Independent Code Optimizer
Semantic Analyzer
Syntax tree
Intermediate Code Generator
intermediate representation
Symbol Table
The symbol table, which stores information about the entire source program, is used by all phases of the compiler
1.2.1. Lexical Analysis The first phase of a compiler is called lexical analysis or scanning The lexical analyzer reads the stream of characters making up the source program and groups them
into meaningful sequences called lexemes For each lexeme the lexical analyzer produces a token of the form:
<token-name, attribute-value>that it passes on to the subsequent phase, syntax analysis
In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the second attribute value points to an entry in the symbol table for this token
For example, suppose a source program contains the following assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer:1. position is a lexeme that would be mapped into a token <id, 1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position. The symbol table entry holds information about the identifier, such as its name and type
2. = is a lexeme that is mapped into the token <=>. Since it needs no attribute value, the second component is omitted
3. initial is a lexeme that would be mapped into a token <id, 2>, where 2 points to the symbol table entry for position
4. + is a lexeme that is mapped into the token <+>5. rate is a lexeme that would be mapped into a token <id, 3>, where 3 points to the symbol
table entry for rate6. * is a lexeme that is mapped into the token <*>7. 60 is a lexeme that is mapped into the token <60>
After lexical analysis, the sequence of tokens in equation 1.1 are
<id, 1> <=> <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and multiplication operators, respectively
1.2.2. Syntax Analysis The second phase of a compiler is syntax analysis or parsing The parser uses the tokens produced by the lexical analyzer to create a tree-like intermediate
representation – called syntax tree – that depicts the grammatical structure of the token stream In a syntax tree, each interior node represents operator and the children of the node represent the
arguments of the operation The syntax tree for the previous token stream in Equation 1,2 is shown in Figure 1.4
4
Figure 1.4: Syntax tree This tree shows the order in which the operations in the assignment in (1.1) are to be performed Multiplication is done first, followed by addition, and finally assignment
1.2.3. Semantic Analysis The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition It also gathers type information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate code generation An important part of semantic analysis is type checking, where the compiler checks that each
operator as matching operands, e.g., many programming language definitions require array index to be an integer; the compiler must report error if floating-point number is used instead
A language specification may permit type coercion, e.g., if binary arithmetic operator is applied to integer and floating point operands, the compiler may convert or coerce the integer into a floating-point number
Suppose that position, initial and rate have been declared to be floating-point numbers, and lexeme 60 by itself forms an integer
Semantic analyzer first converts integer 60 to a floating point number before applying *1.2.4. Intermediate Code Generation
After syntax and semantic analysis, many compilers generate an explicit low-level or machine-like, which we can think of as a program for an abstract machine
This intermediate representation should have two properties: it should be easy to produce and it should be easy to translate it into the target machine
One the intermediate representations called three-address code consists of an assembly like instructions with a maximum of three operands per instruction (or at most one operator at the right side of an assignment operator)
Each operand can act like a register The output of the intermediate code generator can consist of the three-address code sequence
t1 = inttofloat(60)t2 = id3 * t1t3 = id2 + t2id1 = t3 (1.3)
5
The compiler must also generate temporary name to hold the value computed by a three-address instruction
1.2.5. Code Optimization The machine-independent code-optimization phase attempts to improve the intermediate code so
that better target code will result Usually better means faster, but other objectives may be desired, such as shorter code or target
code that consumes less power For example, an algorithm generates the intermediate code (1.3), using an instruction per each
operator in the tree representation that comes from the semantic analyzer The optimizer can deduce that the conversion of 60 from integer to floating point can be done once
for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the floating-point number 60.0
Moreover, t3 is used only once to transmit its value to id1 so that the optimizer can transform (1.3) into the shorter sequencet1 = id3 * 60.0id1 = id2 + t1 (1.4)
1.2.6. Code Generation The code generator takes as an input intermediate representation of the source program and maps
it into the target language If the target language is machine language, registers or memory locations are selected for each of
the variables used by the program Then, the intermediate instructions are translated into sequences of machine instructions to
perform the same task A crucial aspect of code generation is the judicious assignment of registers to hold variables For example, using registers R1 and R2, the intermediate code in (1.4) might get translated into the
machine codeLDF R2, id3MULF R2, R2, #60.0LDF R1, id2ADDF R1, R1, R2STF id1, R1 (1.5)
The first operand of each instruction specifies a destination The F in each instruction tells us that it deals with floating-point numbers The code in (1.5) loads the contents of address id3 into register R2, then multiplies it with floating-
point constant 60.0 The # signifies that 60.0 is to be treated as an immediate constant The third instruction moves id2 into register R1 and the fourth adds to it the value previously
computed in register R2 Finally, the value in register R1 is stored into the address id1, so the code correctly implements the
assignment statement (1.1)
6
1.2.7. Symbol-Table Management An essential function of a compiler is to record the variables names used in the source program and
collect information about various attributes of each name These attributes may provide information about the storage allocated for a name, its type, its scope
(where in the program its value can be used), and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (e.g., by value or by reference), and the type returned
The symbol table is a data structure containing a record for each variable name, with fields for attributes of the name
The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly
1.2.8. The Grouping of Phases into Passes The discussion of phases deals with the logical organization of a compiler In an implementation, activities from several phases may be grouped together into a pass that reads
an input file and writes an output file For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation into one pass Code optimization may be an optional pass Then there could be a back-end pass consisting of code generation for a particular target machine
1.2.9. Compiler Construction Tools There are tools created to help implement the various phases of a compiler These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithm Some commonly used compiler construction tools include:
1. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming language
2. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens of the language
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and generating intermediate code
4. Code generator generators that produce a code a generator form a collection of rules for translating each operation of the intermediate language into the machine language for a target machine
5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted from one part of the program to each other part. Data flow analysis is key part of code optimization
6. Compiler-construction toolkits that provide an integrated set of routines for constructing various phases of a compiler
7
2. Lexical Analysis2.1. The Role of the Lexical Analyzer
The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as an output a sequence of tokens for each lexeme in the source program
The stream of tokens is sent to the parser for syntax analysis The lexical analyzer also interacts with the symbol table, e.g., when the lexical analyzer discovers a
lexeme constituting an identifier, it needs to enter that lexeme into the symbol table The interactions are suggested in Figure 2.1
Figure 2.1: Interactions between the lexical analyzer and the parser
The following are additional tasks performed by the lexical analyzer other than identifying lexemes: Stripping out comments and whitespace (blank, newline, and tab) Correlating error messages generated by the compiler with the source program by keeping track
of line numbers (using newline characters) Expanding macros in some lexical analyzers2.2. Tokens, Patterns, and Lexemes
When discussing lexical analysis, we use three related but distinct terms: A token is a pair consisting of a token name and an optional attribute value. The token name is
an abstract symbol representing a kind of lexical unit, e.g., a keyword, an identifier, etc. The token names are input symbols that the parser processes
A pattern is a description of the form that lexemes of a token may take. In case of a keyword as a token, the pattern is the sequence of characters that forms the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token
In many programming languages, the following classes cover most or all of the tokens: One token for each keyword. The pattern for a keyword is the same as the keyword itself Tokens for operators, either individually or in classes such as comparison (for lexemes: <, >, <=,
>=, ==, !=) One token representing all identifiers
8
source program
Lexical Analyzer
Symbol Table
Parser to semantic analysis
token
getNextToken
One or more tokens representing constants, such as numbers and literal strings Tokens for each punctuation symbol, such as left and right parentheses, comma and semicolon2.3. Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched
This is done by using attribute value that describes the lexeme represented by the token The token name influences parsing decisions, while the attribute value influences translation of
tokens after the parse Practically, a token has one attribute: a pointer to the symbol table entry in which the information
about the token is kept The symbol table entry contains various information about the token such as the lexeme, its type,
the line number in which it was first seen, etc. For example, in assignment statement (in FORTRAN): E = M * C ** 2, the tokens and their attributes
are written as: <id, pointer to symbol-table entry for E> <assign_op> <id, pointer to symbol-table entry for M> <mult_op> <id, pointer to symbol-table entry for C> <exp_op> <number, integer value 2>2.4. Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error
For example, if the string fi is encountered for the first time in a C program in the context: fi (a == f(x)) . . .a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or undeclared function identifier since fi is a valid lexeme for the token identifier
When an error occurs, the lexical analyzer recovers by: Skipping (deleting) successive characters from the remaining input until the lexical analyzer can
find a well-formed token (known as panic mode recovery) Deleting extraneous characters from the remaining input Inserting missing characters from the remaining input Replacing an incorrect character by a correct character Transposing two adjacent characters2.5. Input Buffering
Here we will be looking at some ways that the task of reading the source program can be speeded This task is made difficult by the fact that we often have to look one or more characters beyond the
next lexeme before we can be sure we have the right lexeme We shall introduce a two-buffer scheme that handles large lookaheads safely We then consider an improvement involving sentinels that saves time checking for the ends of
buffers
9
Buffer Pairs Because of the amount of time taken to process characters and the large number of characters that
must be processed during compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead to process a single input character
An important scheme involves two buffers that are alternatively reloaded, as suggested in Figure 2.2
E = M * C * * 2 eof
Figure 2.2: Using a pair of input buffers Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes Using one system read command we can read N characters into a buffer, rather than using one
system call per character If fewer than N characters remain in the input file, then a special character, represented by eof,
marks the end of the source file and is different from any possible character of the source program Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine
2. Pointer forward scans ahead until a pattern match is found Once the lexeme is determined, forward is set to the character at its right end (involves retracting) Then, after the lexeme is recorded as an attribute value of the token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found Advancing forward requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded bufferSentinels(guard)
If we use the previous scheme, we must check each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer
Thus, for each character read, we must make two tests: one for the end of the buffer, and one to determine which character is read
We can combine the buffer-end test with the test for the current character if we extend each buffer to hold sentinel character at the end
E = M * eof C * * 2 eof eof
10
lexemeBegin
forward
lexemeBegin
forward
Figure 2.3: Sentinels at the end of each buffer The sentinel(guard) is a special character that cannot be part of the source program, and a natural
choice is the character eof Note that eof retains its use as a marker for the end of the entire input Any eof that appears other than at the end of buffer means the input is at an end Figure 2.3 shows the same arrangement as Figure 2.2, but with the sentinels added
2.6. Specification of Tokens Regular expression is an important notation for specifying lexeme patterns While they cannot express all possible patterns, they are effective in specifying those types of
patterns that we need for tokens Refer to the previous course (Formal Language Theory) to recap on regular expressions
Regular Definitions For notational convenience, we may wish to give names to certain regular expressions and use those
names in subsequent expressions, as if the names were themselves symbols If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
d1→r1
d2→r2
…dn→rn
where:1. Each di is a new symbol, not in ∑ and not the same as any other of the d’s, and2. Each ri is a regular expression over the alphabet ∑ ∪ {d1,d2, . . . , di-1}
By restricting ri to ∑ and the previously defined d’s, we avoid recursive definitions, and we can construct a regular expression over ∑, for each r i
We do so by first replacing uses of d1 in r2 (which cannot use any of the d’s except for d1), then replacing uses of d1 and d2 in r3 by r1 and (the substituted) r2, and so on
Finally, in rn we replace each di, for i = 1, 2, . . ., n – 1, by the substituted version of r i, each of which has only symbols of ∑
Example: C identifiers are strings of letters, digits, and underscores. Here is a regular definition:
letter_ →A | B | . . . | Z | a | b | . . . | z | _digit →0 | 1 | . . . | 9id →letter_(letter_ | digit)*
Extensions of Regular Expressions Many extensions have been added to regular expressions to enhance their ability to specify string
patterns Here we mention a few notational extensions that were first incorporated into Unix utilities such as
Lex that are particularly useful in specification of lexical analyzers:1. One of more instances. The unary postfix operator + represents the positive closure of a regular
expression and its language. The operator + has the same precedence and associativity as the operator *. r* = r+ | λ and r+ = rr* = r*r
11
2. Zero or one instance. The unary postfix operator ?, means “zero or one occurrence”. That is r? is equivalent to r | λ. The ? operator has the same precedence and associativity as * and +
3. Character classes. A regular expression a1 | a2 | . . . | an, where ai’s are each symbols of the alphabet, can be replaced by the shorthand [a1a2 . . . an]. More importantly, when a1, a2, . . ., an form a logical sequence, e.g., consecutive uppercase letter, lowercase letters, or digits, we can replace them by a1-an, that is, just the first and the last separated by hyphen. Thus [abc] is shorthand for a | b | c, and [a-z] is shorthand for a | b | . . . | z
The following is a regular definition for unsigned number in Pascal (note that it uses extensions regular expressions)digit→0 | 1 | . . . | 9digits→digit+
number→digits(.digits)?(E[+-]?digits)?2.7. Recognition of Tokens
In the previous section we learned how to express patterns using regular expressions Now, we study how to take the patterns for all the needed tokens and build a piece of code that
examines the input string and a prefix that is a lexeme matching one of the patternsTransition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized flowcharts, called “transition diagrams”
Transition diagrams have a collection of nodes called states Each state represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns Edges are directed from one state of the transition diagram to another Each edge is labeled by a symbol or a set of symbols If we are in some state s, and the next input symbol is a, we look for an edge out of state s labeled
by a (and perhaps by other symbols, as well). If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to which that edge leads
Here we assume that our transition diagram is deterministic, meaning that there is never more than one edge out a given state with a given symbol among its labels
Some important conventions about the transition diagram are:1. Certain states are said to be accepting or final. These states indicate that a lexeme has been
found, although the actual lexeme may not consist of all positions between the lexemeBegin and forward pointers. Accepting states are indicated by a double circle, and if there is an action to be taken – typically returning a token and an attribute value to the parser – we shall attach that action to the accepting state
2. In addition, if it is necessary to retract the forward one pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. Any number of *s can be attached depending on the number of positions to retract
3. One state is designated the start state, or initial state; it is indicated by an edge labeled “start”, entering from nowhere. The transition diagram always begins in the start state before any input symbols have been read
12
Figure 2.4 is a transition diagram that recognizes that recognizes the lexemes matching the token relop in Pascal
Being on state 0 (initial state), if we see < as the first input symbol, then the lexemes that match the pattern for relop we can only be looking at <, <>, or <=
We therefore go to state 1 and look at the next character. If it is =, then we recognize lexeme <=, enter state 2, and return the token relop with attribute LE, the symbolic constant representing this particular operator
If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an indication that the not-equals operator has been found
On any other character, the lexeme <, and we enter state 4 to return information. Note, however, that state 4 has a * to indicate that we must retract the input on position
On the other hand, if in state 0 the first character we see is =, then this one character must be the lexeme. We immediately return the fact from state 5
The remaining possibility is that the first character is >. Then, we must enter state 6 and decide, on the basis of the next character, whether the lexeme >= (if we next see the sign =), or just > (on any other character
Note that if, in state 0, we see any character besides <, =, or >, we cannot possibly be seeing a relop lexeme, so this transition diagram will not be used
Figure 2.4 Transition diagram for relop
Similarly, Figure 2.5 shows transition diagram for unsigned number in Pascal
13
token
get nexttoken
Figure 2.5: Transition diagram for unsigned numbers in Pascal
3. Syntax Analysis By design every programming language has precise rules that prescribe the syntactic structure of
well-formed programs The syntax of programming language constructs can be specified by context-free grammars or BNF
notation (both are discussed in the previous course) The use of CFGs has several advantages over BNF:
helps in identifying ambiguities a grammar gives a precise yet easy to understand syntactic specification of a programming
language it is possible to have a tool which produces automatically a parser using the grammar a properly designed grammar helps in modifying the parser easily when the language changes3.1. The Role of a Parser
In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in Figure 3.1, and verifies that the string of token names can be generated by the source language
It is expected that the parser reports any syntax errors in an intelligible fashion and to recover from commonly occurring errors to continue processing the remainder of the program
Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the rest of the compiler for further processing
The methods commonly used in compilers can be classified as being either top-down or bottom-up Top-down methods build parse trees from top (root) to the bottom (leaves), while bottom-up
methods start from the leaves and work their way up to the root In either case, the input to the parser is scanned from left to right, one symbol at a time
14
source program
Lexical Analyzer
Symbol Table
Parser intermediate representation
Rest of Front End
parse tree
Figure 3.1: Position of parser in a compiler model3.2. Syntax Error Handling
Many errors that occur during compilation appear to be syntactic The error handler in a parser has goals that are simple to state but challenging to realize:
Report the presence of errors clearly and accurately Recover from each error quickly enough to detect subsequent errors Add minimal overhead to the processing or correct programs
Fortunately, common errors are simple ones, and relatively straight forward error-handling mechanism often suffices
The error handler should report: the place in the source program where the error is detected the type of error (if possible)
Although no error-recovering strategy has proven itself universally acceptable, a few methods have broad applicability
There are four main strategies in error handling: Panic-Mode Recovery: On discovering an error, the parser discards input symbols one at a time
until one of a designated set of synchronizing tokens is found (usually delimiters, such as }, whose role in the source program is clear and unambiguous). While panic-mode correction often skips considerable amount of input without checking it for additional errors, it has the advantage of simplicity and is guaranteed not to go into an infinite loop
Phrase-Level Recovery: On discovering an error, a parser may perform local correction on the remaining input such that the parser is able to continue. A typical local correction is to replace a comma by a semicolon, insert a missing semicolon, etc. We should be careful to avoid not going into infinite loop. Phrase-level replacement has been used in several error-reporting compilers. Its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection
Error Productions: By anticipating common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. A parser constructed from a grammar augmented by these error productions detects the anticipated errors when the error production is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that has been recognized in the input
Global Correction: Ideally, we would like to make as few changes as possible in processing an incorrect input string. Given an incorrect input string x and grammar G, the algorithms will find a parse tree for a related string y, such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest
3.3. FIRST and FOLLOW The construction of both top-down and bottom-up parsers is aided by two functions, FIRST and
FOLLOW, associated with a grammar G
15
During top-down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol
During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens
FIRST(α), for any string of grammar symbols α, is the set of terminals that begin strings derived from α. If α => λ in zero or more steps, then λ is also in FIRST(α)
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or λ can be added to any FIRST set1. If X is a terminal, then FIRST(X) = {X}2. If X is a nonterminal and X→Y1Y2 . . . Yk is a production for some k ≥ 1, then place α in FIRST(X) if
for some i, α is in FIRST(Yi), and λ is in all of FIRST(Y1), . . ., FIRST(Yi-1); that is Y1 . . . Yi-1 => λ in zero or more steps. If λ is in FIRST(Yj) for all j = 1, 2, . . ., k, then add λ to FIRST(X)
3. If X→λ is a production, then add λ to FIRST(X) Now, we can compute FIRST for any string X1X2 . . . Xn as follows**
Add to FIRST(X1X2 . . . Xn) all non-λ symbols of FIRST(X1) Also add the non-λ symbols of FIRST(X2) if λ is in FIRST(X1); the non-λ symbols of FIRST(X3) if λ is
in FIRST(X1) and FIRST(X2); and so on Finally, add λ to FIRST(X1X2 . . . Xn) if, for all i, λ is in FIRST(Xi)
FOLLOW(A), for a nonterminal A, is the set of terminals α that can appear immediately to the right of A in some sentential form
To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing can be added to any FOLLOW set1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker2. If there is a production A→αBβ, then everything in FIRST(β) except λ is in FOLLOW(B)3. If there is a production A→αB, or a production A→αBβ, where FIRST(β) contains λ, then
everything in FOLLOW(A) is in FOLLOW(B) Compute FIRST and FOLLOW of nonterminals in the following grammar:
E→ TE'E'→+TE' | λT→FT'T'→*FT' | λF→(E) | id3.4. Top-Down Parsing
Top-down parsing can be viewed as the problem of constructing a parse tree for input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first)
Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string At each step of top-down parse, the key problem is that of determining the production to be applied
for a nonterminal, say A. Once A-production is chosen, the rest of parsing consists of “matching” the terminal symbols in the production body with the input string
3.4.1. Recursive-Descent Parsing This method of top-down parsing can be considered as an attempt to find the left most derivation
for an input string
16
It may involve backtracking To construct the parse tree using recursive-descent parsing:
We create one node tree consisting of S Two pointers, one for the tree and one for the input, will be used to indicate where the parsing
process is. Initially, they will be on S and the first input symbol, respectively Then we use the first S-production to expand the tree. The tree pointer will be positioned on the
leftmost symbol of the newly created sub-tree As the symbol pointed by the tree pointer matches that of the symbol pointed by the input
pointer, both pointers are moved to the right Whenever the tree pointer points on a non-terminal, we expand it using the first production of
the non-terminal Whenever the pointers point on different terminals, the production that was used is not correct,
thus another production should be used. We have to go back to the step just before we replaced the non-terminal and use another production
If we reach the end of the input and the tree pointer passes the last symbol of the tree, we have finished parsing
Example: Draw a parse tree for the input string cad using the following grammar S→cAdA→ab | a
3.4.2. Nonrecursive Predictive Parsing A non-recursive predictive parser can be built by maintaining a stack explicitly, rather implicitly via
recursive calls This method, shown in Figure 3.2, uses a parsing table that determines the next production to be
applied The input buffer contains the string to be parsed followed by $ (the right end marker) The stack contains a sequence of grammar symbols with $ at the bottom. Initially, the stack contains
the start symbol of the grammar followed by $. The parsing table is a two dimensional array M[X, a] where X is a nonterminal of the grammar and a
is a terminal or $
17
a + b $
Predictive Parsing
ProgramXYZ$
Parsing Table M
Output
Input
Stack
Figure 3.2: Model of a table-driven predictive parser
The parser program behaves as follows. The program always considers X, the symbol on top of the stack and a, the current input symbol. There are three possibilities:1. X = a = $: the parser halts and announces a successful completion of parsing2. X = a ≠ $: the parser pops x off the stack and advances the input pointer to the next symbol3. X is a nonterminal: the program consults entry M[X, a] which can be an X-production or an error
entry. If M[X, a] = {X→Y1Y2 . . . Yk}, X on top of the stack will be replaced by Y1Y2 . . . Yk (Y1 at the top of the stack). As an output, any code associated with the X-production can be executed. If M[X, a] = error, the parser calls the error recovery method
Construction of a predictive parsing table Follow the following steps to construct the parsing table: For each production A→α of the grammar, do the following:
1. For each terminal a in FIRST(α), add A→α to M[A, a]2. If λ is in FIRST(α), then for each terminal b in FOLLOW(A), add A→α to M[A, b]. If λ is in FIRST(α)
and $ is in FOLLOW(A), add A→α to M[A, $] as well If, after performing the above, there is no production at all in M[A, a], then set M[A, a] to an error
(which we normally represent by an empty entry in the table) Table 3.1 is a parsing table for grammar shown in Section 3.3 constructed using the above algorithm
Table 3.1: Parsing table M for the grammar in Section 3.3
Nonterminal Input Symbolid + * ( ) $
E E→TE' E→TE'E' E'→+TE' E'→λ E'→λT T→FT' T→FT'T' T'→λ T'→*FT' T'→λ T'→λF F→id F→(E)
Figure 3.3 shows moves made by a predictive parser on input id+id*id
Matched Stack Input ActionE$ id+id*id $
TE'$ id+id*id $ output E→TE'FT'E'$ id+id*id $ output T→FT'
idT'E'$ id+id*id $ output F→idid T'E'$ +id*id $ match idid E'$ +id*id $ output T'→λid +TE'$ +id*id $ output E'→+TE'id+ TE'$ id*id $ match +id+ FT'E'$ id*id $ output T→FT'id+ idT'E'$ id*id $ output F→idid+id T'E'$ *id $ match id
18
id+id *FT'E'$ *id $ output T'→*FT'id+id* FT'E'$ id $ match *id+id* idT'E'$ id $ output F→idid+id*id T'E'$ $ match idid+id*id E'$ $ output T'→λid+id*id $ $ output E'→λ
Figure 3.3: Moves made by a predictive parse on input id+id*id3.4.3. LL(1) Grammars
Predictive parsers, that is, recursive descent parsers needing no backtracking, can be constructed for a class of grammars called LL(1)
The first “L” in LL(1) stands for scanning the input from left to right, the second “L” for producing the leftmost derivation, and “1” for using one input symbol of lookahead at each step to make parsing action decisions
The class of LL(1) grammars is rich enough to cover most programming constructs, although care is needed in writing a suitable grammar for the source language
For example, no left-recursive or ambiguous grammar can be LL(1) A grammar G is in LL(1) if and only if whenever A→α | β are two distinct productions of G, the
following conditions hold:1. For no terminal a do both α and β derive strings beginning with a2. At most one of α and β can derive the empty string3. If β => λ in zero or more steps, then α does not derive any string beginning with a terminal in
FOLLOW(A). Likewise, if α => λ in zero or more steps, then β does not derive any string beginning with a terminal in FOLLOW(A)
The first two conditions are equivalent to the statement that FIRST(α) and FIRST(β) are disjoint sets The third condition is equivalent to stating that if λ is in FIRST(β), then FIRST(α) and FOLLOW(A) are
disjoint sets, and likewise if λ is in FIRST(α) Note: LL(1) grammar is a grammar for which the parsing table does not have a multiply-defined
entries Example: Let a grammar G be given by:
S→iEtSS' | aS'→eS | λE→bIs G an LL(1) grammar?
3.4.4. Error Recovery in Predictive Parsing An error is detected in predictive parsing when the terminal on top of the stack does not match the
next input symbol or when there is a nonterminal A on top of the stack and a is the next input symbol and M[A, a] = errorPanic Mode
Panic-mode recovery is based on the idea of skipping over symbols on the input until a token in a selected set of synchronizing tokens appears
Its effectiveness depends on the choice of synchronizing set
19
The sets should be chosen so that the parser recovers quickly from the errors that are likely to occur in practice
Some heuristics are as follows:1. As a starting point place all symbols in FOLLOW(A) into the synchronizing set for nonterminal A.
If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that the parsing can continue
2. It is not enough to use FOLLOW(A) as the synchronizing set for A. For example, if semicolons terminate statements, then keywords that begin statements may not appear in the FOLLOW set of the nonterminal generating expressions. A missing semicolon after an assignment may therefore result in the keyword beginning the next statement being skipped. Often, there is a hierarchical structure on constructs in a language; e.g., expressions appear within statements, which appear within blocks, and so on. We can add to the synchronizing set of a lower-level construct the symbols that begin higher level constructs. For example, we might add keywords that begin statements to synchronizing sets for the non-terminals generating expressions
3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appear in the input
4. If nonterminals can generate the empty string, then the production deriving λ can be used as a default. Doing so may postpone some error detection, but cannot cause an error to be missed. This approach reduces the number of nonterminals that have to be considered during error recovery
5. If the terminal on top of the stack cannot be matched, a simple idea is to pop the terminal, issue a message that the terminal was inserted, and continue parsing. In effect, this approach takes the synchronizing set of a token to consist all other tokens
Phrase-Level Recovery Phrase-level error recovery is implemented by filling in the blank entries in the predictive parsing
table with pointers to error routines These routines may change, insert, or delete symbols on the input and issue appropriate error
messages They may also pop from the stack
3.5. Bottom-Up Parsing Bottom-up parsing corresponds to the construction of a parse tree for an input string beginning at
the leaves (the bottom) and working up towards the root (the top) We will see a general style of bottom-up parsing called shift-reduce parsing
3.5.1. Reductions We can think of shift-reduce parsing process as one of “reducing” a string w to the start symbol of
the grammar At each reduction step a particular substring matching the right side a production (body) is replaced
by the symbol of the left of that production (head), and if the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse
The key decisions in bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds
Example: Consider the grammar
20
S→aABeA→Abc | bB→d
The sentence abbcde can be reduced to S by the following steps:abbcde, aAbcde, aAde, aAbe, S
3.5.2. Handle Pruning Bottom-up parsing during a left-to-right scan of the input constructs a rightmost derivation in
reverse Informally, a “handle” is a substring that matches the body of a production, and whose reduction
represents one step along the reverse of a rightmost derivation Example: Consider the following grammar
E→E+T | TT→T*F | FF→(E) | id
The following steps show the reductions of id*id to Eid*id, F*id, T*id, T*F, T, E
Adding subscripts to the tokens id for clarity, the handles during the parse of id1*id2 are shown in Figure 3.4
Although, T is the body of the production E→T, the symbol T is not a handle in the sentential form T*id2
If T were indeed replaced by E, we would get E*id2, which cannot be derived from the start symbol E Thus, the leftmost substring that matches the body of some production need not be a handle
Right Sentential Form Handle Reducing Productionid1*id2 id1 F→id
F*id2 F T→FT*id2 id2 F→id
T*F T*F T→T*FT T E→T
Figure 3.4: Handles during a parse of id1*id2
Formally, if S =>*rm αAw =>rm αβw, then the production A→β in the position following α is a handle
of αβw Alternatively, a handle of a right sentential form γ is a production A→β and a position of γ where the
string β may be found, such that replacing β at that position by A produces the previous right sentential form in the rightmost derivation of γ
Notice that the string w to the right of the handle must contain only terminal symbols For convenience, we refer to the body β rather than A→β as a handle We say “a handle” rather than “the handle”, because the grammar can be ambiguous A rightmost derivation in reverse can be obtained by “handle pruning”
3.5.3. Shift-Reduce Parsing Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and a
input buffer holds the rest of the string to be parsed
21
As we shall see, the handle always appears at the top of the stack just before it is identified as the handle
$ is used to mark the bottom of the stack and also the right end of the input Initially, the stack is empty, and the string w is on the input as follows:
Stack Input$ w$
During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the stack, until it is ready to reduce a string β of grammar symbols on top of the stack
It then reduces β to the head the appropriate production The parser repeats this cycle until it has detected an error or until the stack contains the start
symbol and the input is empty:Stack Input$E $
Upon entering this configuration, the parser halts and announces successful completion of parsing Figure 3.5 steps through the actions a shift-reduce parser might take in parsing input string id1*id2
Stack Input Action$ id1*id2 $ shift$ id1 *id2 $ reduce by F→id$ F *id2 $ reduce by T→F$ T *id2 $ shift$ T* id2 $ shift$ T*id2 $ reduce by F→id$ T*F $ reduce by T→T*F$ T $ reduce by E→T$ E $ acceptFigure 3.5: Configuration of a shift-reduce parser on input id1*id2
There are four possible actions a shift-reduce parser can make: shift, reduce, accept, and error1. Shift. Shift the next input symbol on the top of the stack2. Reduce. The parser knows the right end of the handle is at the top of the stack. It should then
decide what non-terminal should replace that substring3. Accept. Announce successful completion of parsing4. Error. Discover a syntax error and call an error recovery routine
3.5.4. Conflicts during Shift-Reduce Parsing There are CFGs for which shift-reduce parsing cannot be used Every shift-reduce parser for such a grammar can reach a configuration in which the parser, knowing
the entire stack and also the next k input symbols, cannot decide whether to shift or to reduce (a shift/reduce conflict), or cannot decide which of several reductions to make (a reduce/reduce conflict)
Technically, these grammars are not in the LR(k) class of grammars and are referred to as non-LR The k in LR(k) refers to the number of symbols of lookahead on the input
3.5.5. Introduction to LR Parsing
22
The most prevalent type of bottom-up parsers today is based on a concept called LR(k) parsing; the “L” is for left-to-right scanning of the input, the “R” for constructing a rightmost derivation in reverse, and the k for the number of input symbols of lookahead that are used in making parsing decisions
The cases k = 0 or k = 1 are of practical interest, and we shall only consider LR parsers with k ≤ 1here When (k) is omitted, k is assumed to be 1 This section introduces the basic concepts of LR parsing and the easiest method for constructing
shift-reduce parsers, called “simple LR” (or SLR, for short) Later, two more complex methods – canonical-LR and LALR – that are used in the majority of LR
parsers are introduced
The LR-Parsing Algorithm A schematic representation of an LR parser is shown in Figure 3.6 It consists of an input, an output, a stack, a driver program, and a parsing table that has two parts
(action and goto) The driver program is the same for all LR parsers, only the parsing table changes from one parser to
another The parsing program reads characters from an input buffer one at a time The program uses a stack to store a string of the form s0X1s1X2 . . . Xmsm, where sm is on top Each Xi is a grammar symbol and each si is a symbol called a state Each state symbol summarizes the information contained in the stack below it, and the combination
of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision
Figure 3.6: Model of an LR parser The parsing table consists of two parts: a parsing-action function action and a goto function goto The program driving the LR parser behaves as follows: It determines sm, the state currently on top of the stack, and ai, the current input symbol
23
a1 . . . ai . . . an$
LR Parsing
Programsm
Xm
sm-1
Xm-1
. . . s0 action goto
Output
Input
Stack
It then consults action[sm, ai], the parsing action table entry for state sm and input ai, which can have one of four values:1. shift s, where s is a state,2. reduce by a grammar production A→β,3. accept, and 4. error
The function goto takes a state and grammar symbol as arguments and produces a state A configuration of an LR parser is a pair whose first component is the stack contents and whose
second component is the unexpended input:(s0X1s1X2s2 . . . Xmsm, aiai+1 . . . an$)
This configuration represents a right-sentential form X1X2 . . . Xm ai ai+1 . . . an
is essential the same way as a shift-reduce parser would; only the presence of states on the stack is new
The next move of the parser is determined by reading a i, the current input symbol, and sm, the state on top of the stack, and then consulting the parsing action table entry action[sm, ai]
The configurations resulting after each of the four types of move are as follows:1. If action[sm, ai] = shift s, the parser executes a shift move, entering the configuration
(s0X1s1X2s2 . . . Xmsmais, ai+1 . . . an$)Here the parser has shifted both the current input symbol a i and the next state s, which is given in action[sm, ai], onto the stack; ai+1 becomes the current input symbol
2. If action[sm, ai] = reduce A→β, then the parser executes a reduce move, entering the configuration(s0X1s1X2s2 . . . Xm-rsm-rAs, aiai+1 . . . an$)where s = goto[sm-r, A] and r is the length of β, the right side of the productionHere the parser first popped 2r symbols off the stack (r state symbols and r grammar symbols), exposing state sm-r
The parser then pushed both A, the left side of the production, and s, the entry for goto[sm-r, A], onto the stackThe current input symbol is not changed
3. If action[sm, ai] = accept, parsing is completed4. If action[sm, ai] = error, the parser has discovered an error and call an error recovery routine
State action gotoid + * ( ) $ E T F
0 s5 s4 1 2 31 s6 acc2 r2 s7 r2 r23 r4 r4 r4 r44 s5 s4 8 2 35 r6 r6 r6 r66 s5 s4 9 37 s5 s4 10
24
8 s6 s119 r1 s7 r1 r110 r3 r3 r3 r311 r5 r5 r5 r5
Figure 3.7: Parsing table for expression grammar1. si means shift and stack state i, 2. rj means reduce by a production numbered j,3. acc means accept4. blank means error
Figure 3.7 shows the parsing action and goto functions of an LR parsing table for arithmetic expressions with binary operators + and *:1. E→E+T 2. E→T3. T→T*F4. T→F5. F→(E)6. F→id
On input id*id+id, the sequence of stack and input contents is shown in Figure 3.8
Stack Input Action(1) 0 id*id+id $ shift(2) 0 id 5 *id+id $ reduce by F→id(3) 0 F 3 *id+id $ reduce by T→F(4) 0 T 2 *id+id $ shift(5) 0 T 2*7 id+id $ shift(6) 0 T 2*7 id 5 +id $ reduce by F→id(7) 0 T 2*7 F 10 +id $ reduce by T→T*F(8) 0 T 2 +id $ reduce by E→T(9) 0 E 1 +id $ shift
(10) 0 E 1+6 id $ shift(11) 0 E 1+6 id 5 $ reduce by F→id(12) 0 E 1+6 F 3 $ reduce by T→F(13) 0 E 1+6 id 9 $ reduce by E→E+T(14) 0 E 1 $ accept
Figure 3.8 Moves of LR parser on id*id+idConstructing SLR Parsing Tables
This method is the simplest of the three methods used to construct an LR parsing table It is called SLR (simple LR) because it is the easiest to implement However, it is also the weakest in terms of the number of grammars for which it succeeds A parsing table constructed by this method is called SLR table A grammar for which an SLR table can be constructed is said to be an SLR grammar
LR(0) Item An LR(0) item (item for short) is a production of a grammar G with a dot at some position of the right
side
25
For example for the production A→XYZ yields the four items:A→. XYZA→X.YZA→XY.ZA→XYZ.
For the production A→λ we only have one item:A→.
An item indicates what is the part of a production that we have seen and what we hope to see The central idea in the SLR method is to construct, from the grammar, a deterministic finite
automaton to recognize viable prefixes A viable prefix is a prefix of a right sentential form that can appear on the stack of a shift/reduce
parser If you have a viable prefix in the stack it is possible to have inputs that will reduce to the start
symbol If you don’t have a viable prefix on top of the stack you can never reach the start symbol; therefore
you have to call the error recovery procedure One collection of sets of LR(0) items, which we call the canonical LR(0) collection, provides the basis
for constructing SLR parsers To construct the canonical LR(0) collection for a grammar, we define an augmented grammar and
two functions: closure and goto If G is a grammar with start symbol S, then G’, the augmented grammar for G, is G with a new start
symbol S’ and production S’→S The purpose of this new starting production is to indicate the parser when it should stop parsing and
announce the acceptance of the input, i.e., acceptance occurs when and only when the parser is about to reduce S’→SThe Closure Operation
If I is a set of items of G, then closure(I) is the set of items constructed by two rules:1. Initially, every item in I is added to closure(I)2. If A→α.Bβ is in closure(I) and B→γ is a production, then add B→.γ to I. This rule is applied until
no more new items can be added to closure(I) Intuitively, A→α.Bβ in closure(I) indicates that, at some point in the parsing process, we think we
might next see a substring from Bβ as input If B→γ is a production, we also expect we might see a substring derivable from γ at this point For this reason we also include B→.γ in closure(I) Example: Consider the augmented grammar expression (G1’):
E’→EE→E+TE→TT→T*FT→FF→(E)F→id
26
If I is the set of one item: {[E’→.E]}, then closure(I) contains the items:E’→.EE→.E+TE→.TT→.T*FT→.FF→.(E)F→.idThe Goto Operation
The second useful function is goto(I, X) where I is a set of items and X is a grammar symbol goto(I, X) is defined as the closure of all items [A→αX.β] such that [A→α.Xβ] is in I Intuitively, if I is the set of all items that are valid for some viable prefix γ, then goto(I, X) is the set of
items that are valid for the viable prefix γX Example: If I is the set of two items {[E’→E.], [E→E.+T]}, then goto(I, +) consists of
E→E+.TT→.T*FT→.FF→.(E)F→.id The Sets-of-Items Construction
Below is the algorithm to construct C, the canonical collection of sets of LR(0) items for an augmented (amplified) grammar G’procedure items(G’);begin
C := {closure({[S’→.S]})repeat
for each set of items I in C and each grammar symbol Xsuch that goto(I, X) is not empty and not in C do
add goto(I, X) to Cuntil no more sets of items can be added to C
end Example: Construction of the set of Items for the augmented grammar above G1’
I0 = {[E’→.E], [E→.E+T], [E→.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I1 = goto(I0, E) = {[E’→E.], [E→E.+T]}I2 = goto(I0, T) = {[E→T.], [T→T.*F]}I3 = goto(I0, F) = {[T→F.]}I4 = goto(I0, () = {[F→(.E)], [E→.E+T], [E→.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I5 = goto(I0, id) = {[F→id.]}I6 = goto(I1, +) = {[E→E+.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]}I7 = goto(I2, *) = {[T→T*.F], [T→.F], [F→.(E)], [F→.id]}I8 = goto(I4, E) = {[F→.(E)], [E→E.+T]}
goto(I4, () = I4;
27
goto(I4, id) = I5;I9 = goto(I6, T) = {[E→E+T.], [T→T.*F]}
goto(I6, F) = I3;goto(I6, () = I4;goto(I6, id) = I5;
I10 = goto(I7, F) = {[T→T*F.]}goto(I7, () = I4;goto(I7, id) = I5;
I11= goto(I8, )) = {[F→(E).]}goto(I8, +) = I6;goto(I9, *) = I7;
Valid Items We say an item A→β1.β2 is valid viable prefix of αβ1 if there is derivation S’ derives in zero or more
steps to αβw and αβw drives in zero or more steps to αβ1β2w If A→β1.β2 is valid, it tells us to shift or reduce when αβ1 is on top of the stack.
Theorem The set of valid items embodies all the useful information that can be gleamed from the stack
SLR Parsing Tables Below is the method used to construct the SLR table This method is the weakest, in terms of number of grammars it succeeds for, of all the three
methods we are going to see but it is also the simplest to implementSLR Table Construction Method1. Construct C = {I0, I1, . . ., IN} the collection of the set of LR(0) items for G’2. State i is constructed from Ii and
a. If [A→α.aβ] is in Ii and goto(Ii, a) = Ij (a is a terminal), then action[i, a] = shift jb. If [A→α.] is in Ii, then action[i, a] = reduce A→α for a in FOLLOW(A) for A≠S’c. If [S’→S.] is in Ii, then action[i, $] = accept
If no conflicting action is created by 1 and 2 the grammar is SLR(1); otherwise it is not3. For all non-terminals A, if goto(Ii, A) = Ij then goto[i, A] = j4. All entries of the parsing table not defined by 2 and 3 are made error5. The initial state is the one constructed from the set of items containing [S’→.S]
Example: Construct the SLR parsing table for the grammar G1’FOLLOW(E) = {+, ), $}FOLLOW(T) = {+, ), $, *}FOLLOW(F) = {+, ), $, *}By following the aforementioned method we find the parsing table used earlier
Example: Construct the SLR parsing table for the following grammar (G2’)S’→SS→L=RS→RL→*RL→id
28
R→LC = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}I0 = {[S’→.S], [S→.L=R], [S→.R], [L→.*R], [L→.id], [R→.L]}I1 = goto(I0, S) = {[S’→S.]}I2 = goto(I0, L) = {[S→L.= R], [R→L.]}I3 = goto(I0, R) = {[S→R.]}I4 = goto(I0, *) = {[L→*.R] [L→.*R], [L→.id], [R→.L]}I5 = goto(I0, id) = {[L→id.]}I6 = goto(I2, =) = {[S→L =.R], [R→.L], [L→.*R], [L→.id]}I7 = goto(I4, R) = {[L→*R.]}I8 = goto(I4, L) = {[R→L.]}
goto(I4, *) = I4
goto(I4, id) = I5
I9 = goto(I6, R) = {[S→L=R.]}goto(I6, L) = I8
goto(I6, *) = I4
goto(I6, id) = I5
FOLLOW(S) = {$}, FOLLOW(R) = {$, =}, and FOLLOW(L) = {$, =} We have shift/reduce conflict since = is in FOLLOW(R) and R→L. is in I2 and goto(I2, =) = I6
G2’ is not an ambiguous grammar. However, it is not SLR This is because the SLR parser is not powerful enough to remember enough left context to decide
whether to shift or reduce when it sees an = The next two methods are capable of solving this problem But, there still are some grammars that cannot be parsed by any of the methods that are seen here
Canonical LR Parsing Let us analyze why we had a shift/reduce conflict with the previous example With the SLR method, state i calls for reduction by A→α if items I i contain item [A→α.] and a is in
FOLLOW(A) In the example, R→L. is in I2 and = is in FOLLOW(R) However, in some situations, when state i appears on top of the stack, the viable prefix βα on the
stack is such that βA cannot be followed by a in a right sentential form Thus, the reduction by A→α would be invalid on input a In the previous example, reducing would be an error In fact the states produced using LR(0) items do not hold enough information It is possible to hold more information in the state to rule out some of these invalid reductions. By
splitting states when necessary, we indicate which symbol can exactly follow a handle An LR(1) item has the form of [A→α.β, a] where a is a terminal or $ The functions closure(I), goto(I, X) and Items (G’) will be slightly different compared to those used
for the production of an SLR parsing tableThe Closure Operation
I is a set of LR(1) items closure(I) is found using the following algorithm:
29
closure(I) := Irepeat
for each item [A→α.Bβ, a] in closure(I), each production B→γ in G’and each terminal b in FIRST(βa) such that [B→.γ, b] is not in closure(I) do
add [B→.γ, b] to closure(I)until no more items can be added to closure(I)
Example: This example uses Grammar G2’ Closure({[S’→.S, $]}) = {[S’→.S, $], [S→.L=R, $], [S→.R, $], [L→.*R, =], [L→.id, =], [R→.L, $], [L→.*R,
$], [L→.id, $]}FIRST($) = {$}FIRST(=R$) = {=}The Goto Operation
goto(I, X) is defined as the closure of all items [A→αX.β, a] such that [A→α.Xβ, a] is in I Example: goto(I0, S) = {[S’→S., $]}
The Sets-of-Items Construction Below is an algorithm to construct C, the canonical collection of sets of LR(1) items for augmented
grammar G’ You can note that it is the same as the one used for the construction of the canonical collection of
sets of LR(0) items; only the goto and closure operations differprocedure Items (G’);begin
C := {closure({[S’→.S, $]})}repeat
for each item of I in C and each grammar symbol X such that goto(I, X)is not empty and not in C do
add goto(I, X) to C;until no more sets of items can be added to C
endC = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}I0 = {[S’→.S, $], [S→.L=R, $], [S→.R, $], [L→.*R, =/$], [L→.id, =/$], [R→.L, $]}I1 = goto(I0, S) = {[S’→S., $]}I2 = goto(I0, L) = {[S→L.= R, $], [R→L., $]}I3 = goto(I0, R) = {[S→R., $]}I4 = goto(I0, *) = {[L→*.R, =/$], [L→.*R, =/$], [L→.id, =/$], [R→.L, =/$]}I5 = goto(I0, id) = {[L→id., =/$]}I6 = goto(I2, =) = {[S→L =.R, $], [R→.L, $], [L→.*R, $], [L→.id, $]}I7 = goto(I4, R) = {[L→*R., =/$]}I8 = goto(I4, L) = {[R→L., =/$]}
goto(I4, *) = I4
goto(I4, id) = I5
I9 = goto(I6, R) = {[S→L=R., $]}I10 = goto(I6, L) = {[R→L., $]}
30
I11 = goto(I6, *) = {[L→*.R, $], [L→.*R, $], [L→.id, $], [R→.L, $]}I12 = goto(I6, id) = {[L→id., $]}
goto(I11, *) = I11
goto(I11, id) = I12
goto(I11, L) = I10
I13 = goto(I11, R) = {[L→*R., $]}Construction of LR Parsing Table
1. Construct C = {I0, I1, . . ., IN} the collection of LR(1) items for G’2. State i of the parser is constructed from state I i. The parsing actions for state i are determined as
follows:a. If [A→α.aβ, b] is in Ii and goto(Ii, a) = Ij (a is a terminal), then action[i, a] = shift jb. If [A→α., a] is in Ii and A≠S’, then action[i, a] = reduce A→αc. If [S’→S., $] is in Ii, then action[i, $] = accept
If there is a conflict, the grammar is not LR(1)3. If goto(Ii, A) = Ij, then goto[i, A] = j4. All entries not defined by (2) and (3) are made error5. The initial state is the set constructed from the item [S’→.S, $] The table constructed using this method is called the canonical LR(1) parsing table
LALR parsing The third LR parsing table that is going to be seen is the LALR parsing table construction This is the most widely used method since the table obtained using this method is much smaller
than that obtained by canonical LR method For example, for a language such as Pascal, the LALR parsing table will have some hundred of states
while the canonical LR parsing table will have several thousands of states Much more grammars are LALR than SLR, but there are only few grammars that are LR but not LALR,
and these grammars can be easily avoided Easy algorithm for constructing an LALR parsing table
Definition We call the core of a set of LR(1) items, the set of the first components of the set of LR(1) items Example: In the canonical set of LR(1) items of the previous example, both I6 and I11 have the
following identical core:{[L→*.R], [L→.*R], [L→.id], [R→.L]}Method
Refer to Algorithm 4.11 on Page 238 of the textbook Exercise: Construct the LALR parsing table for the previous grammar Some remarks for constructing LALR parsing tables: The above technique for constructing LALR tables is not the best one; the above technique takes
more time to construct the LALR table than the canonical LR table In practice, we don’t use the LR(1) items to construct the LALR table We construct directly the LALR items. This takes much less time One other problem is the size of the table. An LALR table may have 20,000 entries In this condition it is important to use compression techniques so that the table occupies less space
31
For example, rather than storing the same information in several entries, a pointer to the information may be stored in the table entries
We can also minimize the space needed to store the table by storing a list of actions rather than storing all the entries even when they are empty
32
-
4. Syntax-Directed Translation Parsing an input to do nothing about it is useless In fact, an input is parsed so that some code is generated, information is displayed, etc. These different actions are done by the semantic actions associated to the different rules of the
grammar To execute these actions the parsing table may or may not be constructed first. Indeed, the best
solution performance-wise is to execute the actions while producing the parse tree4.1. Syntax Directed Definitions
A syntax directed definition is a generalization of the CFG in which each grammar symbol has an associated set of attributes (synthesized and inherited) and rules
An attribute can represent anything we choose (a string, a number, a type, a memory location, etc.) The value of a synthesized attribute is computed from the values of attributes at the children of that
node in the parse tree and itself The value of an inherited attribute is computed from the values of attributes at the siblings and/or
parent of that node in the parse tree Semantic rules calculate the values of attributes hence set-up dependencies between attributes that
will be represented by a graph and itself Terminals can have synthesized attributes, but not inherited attributes The dependency graph enables to find an evaluation order for the semantic rules A parse tree showing the values of the attributes is called an annotated or decorated parse tree Exercise: Example of an attributed grammar:
/* We have two attributes defined for this grammar */Syntax Rules Semantic RulesN→L1.L2 N.v = L1.v + L2.v / (2L2.l)L1→L2 B L1.v = 2 * L2.v + B.v
L1.l = L2.l + 1L→B L.v = B.v
L.l = 1B→0 B.v = 0B→1 B.v = 1
In the above example, everything is calculated from leaves to root; all attributes are synthesized Thus all the attributes can be calculated in a single pass Exercise: Example of another attributed grammar:
/* We have three attributes defined for this grammar */Syntax Rules Semantic RulesN→L1 . L2 N.v = L1.v + L2.v
L1.s = 0L2.s = -L2.l
L1→L2 B L1.l = L2.l + 1
33
L2.s = L1.s + 1B.s = L1.sL1.v = L2.v + B.v
L→B L.v = B.vL.l = 1B.s = L.s
B→0 B.v = 0B→1 B.v = 1 * 2B.s
Exercise: Draw the decorated parse tree for input 1011.01Formal definition
In a syntax directed definition, each grammar production A→α has associated with it a set of semantic rules of the form:b := f (c1, c2, . . ., ck) where f is a function and b, c1, . . ., ck are attributes of A and the symbols at the right side of the production
We say that: b is synthesized attribute of A if c1, c2, . . ., ck are attributes belonging to the grammar symbols of
the production and, b is inherited attribute of one of the grammar symbols on the right side of the production if c1,
c2, . . ., ck are attributes belonging to the grammar symbols of the production and, In either case, we say that attribute b depends on attributes c1, c2, . . ., ck
Exercice: Attributed grammar that calculate the value of the expression,Note: id.value is the attribute of id that gives its valueSyntax Rules Semantic RulesE→E + T E1.v := E2.v + T.vE→T E.v := T.vT→T * F T1.v := T2.v * F.vT→F T.v := F.vF→id F.v := id.valueF→(E) F.v := E.v
Exercise: Attributed grammar that associates to an identifier its typeSyntax Rules Semantic RulesD→T L L.in := T.typeT→real T.type := realL→ident ident.type := L.inL1→L2, ident L2.in := L1.in
ident.type := L1.in Note: Here, the semantic action of an ident may have the side effect of adding the type in the
symbol table for that particular identifierDependency Graph
The attributes should be evaluated in a given order because they depend on one another The dependency of the attributes is represented by a dependency graph
b(j) ----D()---> a(i) if and only if there exists a semantic action such as a(i) := f (... b (j) ...)
34
Algorithm for the construction of the dependency graphfor each node n in the parse tree do
for each attribute a of the grammar symbol at n doConstruct a node in the dependency graph for a
for each node n in the parse tree dofor each semantic rule b := f (c1, c2, . . ., ck) associated with the production used at n do
for i := 1 to k doConstruct an edge from the node for ci to the node for b;
Evaluation Order A topological sort of a directed acyclic graph is any ordering m1, m2, . . ., mk of the nodes of the graph
such that edges go from nodes earlier in the ordering to later nodes. i.e., if m i→mj is an edge from mi to mj then mi appears before mj in the ordering
Note: It is not possible to find such an order for a cyclic graph Several methods have been proposed for evaluating semantic rules:
1. Parse rule based methods: for each input, the compiler finds an evaluation order. These methods fail only if the dependency graph for that particular parse tree has a cycle
2. Rule based methods: the order in which the attributes associated with a production are evaluated is predetermined at compiler-construction time. For this method, the dependency graph need not be constructed
3. Oblivious methods: the evaluation order is chosen without considering the semantic rules. This restricts the class of syntax directed definition that can be used
4.2. Some Classes of Non-Circular Attributed Grammars4.2.1. S-Attributed Grammars
We say that an attributed grammar is S-Attributed when all of its attributes are synthesized; i.e., it doesn't have inherited attributes. Synthesized attributes can be evaluated by a bottom-up parser as the input is being parsed
A new stack will be maintained to store the values of the attributes as in the following example Example:
E→E1 + E2 {E.v = E1.v + E2.v}Val(newTop) = Val(oldTop) + Val(oldTop – 2)$$ = $1 + $3 (in Yacc)
We assume that the synthesized attributes are evaluated just before each reduction Before the reduction, attributes of E is in Val(Top) and attributes of E1 and E2 are in Val (Top – 1) and
Val(Top - 2), respectively After the reduction, E is put at the top of the State stack and its attribute values are put at the top of
Val The semantic actions that reference the attributes of the grammar will in fact be translated by the
Compiler generator (such as Yacc) into codes that reference the value stack4.2.2. L-Attributed Grammars
It is difficult to execute the tasks of the compiler just by synthesized attributes The L-attributed (L stands for left) class of grammar allows a limited kind of inherited attributes
35
Definition: A grammar is L-Attributed if and only if for each rule X0→X1 X2 . . . Xj . . . XN, all inherited attributes of Xj depend only on:1. Attributes of X1, . . ., Xj-1
2. Inherited attributes of X0
Of course all S-attributed grammars are L-attributed Example:
A→L M {L.h = f1(A.h)M.h = f2(L.s)A.s = f3(M.s)}
This production does not contradict the rules of L-attributed grammars. Therefore, the corresponding grammar may be L-attributed if all of the other productions follow the rule of L-attributed grammars
Example: A→Q R {R.h = f4(A.h)
Q.h = f5(R.s)A.s = f6(Q.s)}
The grammar containing this production is not L-Attributed since Q.h depends on R.s which contradicts with the rule of L-attributed grammars
36
5. Type Checking The compiler must check if the source program follows semantic conventions of the source language This is called static checking (to distinguish it from dynamic checking executed during execution of
the program) Static checking ensures that certain kind of errors are detected and reported The following are examples of static checks:
1. Type Checks. A compiler should report an error if an operator is applied to an incompatible operand
2. Flow-of-Control Checks. Statements that cause flow of control to leave a construct must have some place to which to transfer the flow of control. For example, a break instruction in C that is not in an enclosing statement
3. Uniqueness Checks. There are situations in which an object must be defined exactly once. For example, in Pascal, an identifier must be declared uniquely, labels in a case statement must be distinct, and elements in a scalar type may not be repeated
4. Name-Related Checks. Sometimes, the same name must appear two or more times . For example, in Ada, a loop or block may have a name that appears at the beginning and end of the construct. The compiler must check that the same name is used at both places
A type checker verifies that the type construct matches that expected by its context. For example, a type checker should verify that the type value assigned to a variable is compatible with the type of the variable
Type information produced by the type checker may be needed when the code is generated It is especially important for overloaded and polymorphic functions An overloaded function represents different operations in different context (e.g., the + operator) A polymorphic function can be executed with arguments of several types
5.1. Type Systems In almost all languages, types are either basic or constructed In Pascal, basic types are boolean, character, integer and real Enumearted types and subranges can also be considered as basic types Pascal also allows a programmer to construct types from basic and other constructed types Types produced in such a way are called constructed types For example, arrays, records and sets are constructed types
Type Expressions The type of a language construct will be denoted by a type expression A type expression is either a basic type or formed by applying an operator called type constructor to
the type expression In this chapter the following are type expressions:
1. A basic type is a type expression2. A type name is a type expression3. A type constructor applied to a type expression is a type expression. Constructors include:
a. Arrays. If I in an index set and T is a type expression, then array (I, T) is a type expression
37
b. Products. If T1 and T2 are type expressions, the Cartesian product T1 × T2 is a type expression. × is left associative
c. Records. The difference between products and records is that records is that record fields have namesFor example, in Pascal:
type row = recordaddress: integer;lexeme: array [1 .. 15] of char;
end;var table : array [1 .. 101] of row;
The type expression corresponding to the above type will be:record ((address × integer) × (lexeme × array (1..15, Char)))
d. Pointers. If T is a type expression then pointer(T) is a type expressione. Functions. The type expression of a function has the form D→R where D is the type
expression of the parameters and R is the type expression of the returned value. For example, the type expression corresponding to the Pascal declaration
function f (a, b : char): ^integer:is
char × char→pointer (integer)A convenient way to represent a type expression is to use a graph (tree or DAG). For example, the type expression corresponding to the above function declaration can be represented with the tree shown on page 347 of your book
4. Type expression may contain variables whose values are type expressionsType Systems
A type system is a collection of rules implemented by the type checker for assigning type expressions to the various parts of a program
Different type systems may be used by different compilers for the same language For example, some compilers implement stricter rules than others In UNIX system, for instance, the lint, for instance, examines C programs for possible bugs using a
more detailed type system that the C compiler itself usesError Recovery
At the very least, the compiler must report the nature and location of errors It is also desirable that the type checker recovers from errors and continues parsing the rest of the
input The inclusion of error handling may result in a type system that goes beyond the one needed to
specify correct programs For example, we may not know the type of incorrectly formed program fragments. In this case,
techniques similar to those needed for languages that do not require identifiers to be declared will have to be used to cope with missing information5.2. Specification of a simple type checker
The following example gives the type system for a Pascal grammar specified in a syntax directed manner
38
Expressions:E→literal {E.type := char}E→num {E.type := integer}E→id {E .type := lookup (id.entry)}E→E1 mod E2 {E.type := if E1.type = integer and E2.type = integer then
integerelse
type_error}E→E1[E2] {E.type := if E2.type = integer and E1.type = array(s, t) then
telse
type_error}E→^E1 {E.type := if E2.type = pointer(t) then
telse
type_error}Statements:S→id := E {S.type := if id.type = E.type then
voidelse
type_error}S→if E then S1 {S.type := if E.type = boolean then
S1.typeelse
type_error}S→while E do S1 {S.type := if E.type = boolean then
S1.typeelse
type_error}S→S1 ; S2 {S.type := if S1.type = void and S2.type = void then
voidelse
type_error}The following example gives type checking system for function calls:E→E1 (E2) {E.type := if E2.type = s and E1.type = S→t then
telse
type_error}
5.3. Equivalence of Type Expressions In the above sections we compared the type expressions using the equal operators However, such an operator is not defined except perhaps between basic types In fact we should rather use the equivalent operator that is more appropriate A natural notion of equivalence is structural equivalence: two type expressions are structurally
equivalent if and only if they are the same basic type or are formed by applying the same constructor to structurally equivalent types
39
For example, integer is equivalent only to integer and pointer (integer) is structurally equivalent to pointer (integer)
Some relaxing rules are very often added to this notion of equivalence For example, when arrays are passed as parameter, the array boundaries of the effective
parameters may not be the same as those of the formal parameters. In some languages, types may be given names. For example in Pascal we can define:
typeptr = ^integer
varA : ptr;B : ptr;C : ^integer;D, E : ^integer;
Do the variables A, B, C, D, E have the same type? Surprisingly, the answer varies from implementation to implementation When names are allowed in type expressions, a new notion of name equivalence is introduced:
name equivalence We have name equivalence between two type expressions if and only if they are identical Under structural equivalence, names are replaced by type expressions they define, so two types are
structurally equivalent if they represent two structurally equivalent type expressions when all names have been substituted
For example, ptr and pointer (integer) are not name equivalent but they are structurally equivalent Note: Confusion arises from the fact that many implementations associate an implicit type name
with each declared identifier Thus, C and D of the above example may not be name equivalent
5.4. Type Conversions Consider expressions like x + i. Where x is of type real and i of type integer Of course, the machine cannot execute this operation as it involves different types of values However, most languages accept such expressions to be used; the compiler will be in charge of
converting one of the operand into the type of the other The type checker can be used to insert these conversion operations into the intermediate
representation of the source program For example, an operator inttoreal may be inserted whenever an operand need to implicitly
converted
40
6. Intermediate Code Generation In a compiler, the front end translates a source program into an intermediate representation, and
the back end generates the target code from this intermediate representation The use of a machine independent intermediate code (IC) is:
retargeting to another machine is facilitated the optimization can be done on the machine independent code
In this chapter, we consider that the error checking is done in another pass, even though in practice the IC generation and the type checking can be done at the same time6.1. Intermediate Languages
Syntax trees, postfix notations, and three-address code are intermediate representationsSyntax Tree
A syntax tree (abstract tree) is a condensed form of parse tree useful for representing language constructs
For example, for the string a + b, the parse tree in (a) below will be represented by the syntax tree shown in (b); in a syntax tree, operators and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree
a. Parse tree for a + b b. Abstract tree for a + bPostfix Notation
The postfix notation is practical for an intermediate representation as the operands are found just before the operator
In fact, the postfix notation is a linearized representation of a syntax tree e.g., 1 + 2 * 3 will be represented in the postfix notation as 1 2 + 3 *
Three-Address code The three address code is a sequence of statements of the form:
X := Y op Zwhere: X, Y, and Z are names, constants or compiler-generated temporaries op is an operator such as integer or floating point arithmetic operator or logical operator on Boolean data
Important Notes: No built-up arithmetic operator is permitted Only one operator at the right side of the assignment is possible, i.e., x + y + z is not possible Similarly to postfix notation, the three-address code is a linearized representation of a syntax
tree. It has been given the name three-address code because such an instruction usually contains three-addresses (the two operands and the result)
Types of Three-Address Statements
41
As with an assembler statement, the three-address code statement can have: symbolic labels, as well as statements for control flow
Common three-address code statements:
Statement Format Comments1. Assignment (binary operation) X := Y op Z Arithmetic and logical operators used2. Assignment (unary operation) X := op Y Unary -, not, conversion operators used3. Copy statement X := Y4. Unconditional jump goto L5. Conditional jump If X relop y goto L6. Function call
param X1
param X2
…param Xn
call p, n
The parameters are specified by param
The procedure p is called by indicating the number of parameters n
7. Indexed arguments X := Y [I]
Y[I] := X
X will be assigned the value at the address Y + IThe value at the address Y + I will be assigned X
8. Address and pointer assignments
X := & YX := *Y
*X = Y
X is assigned the address of YX is assigned the element at the address YThe value at the address X is assigned Y
The choice of allowable operators is an important issue in the design of an intermediate form It should be rich enough to implement the operations of the source language and yet it should not
be too complicated to be translated in the target languageSyntax-Directed Translation into Three-Address Code
Syntax directed translation can be used to generate the three-address code Generally, either the three-address code is generated as an attribute of the attributed parse tree or
the semantic actions have side effects that write the three-address code statements in a file When the three-address code is generated, it is often necessary to use temporary variables and
temporary names To this end the following functions are given:
newtemp - each time this function is called, it gives distinct names that can be used for temporary variables
newlabel - each time this function is called, it gives distinct names that can be used for label names
In addition, for convenience, we use the notation gen to create a three-address code from a number of strings
gen will produce a three-address code after concatenating all the parameters
42
For example, if id1.lexeme = x, id2.lexeme =y and id3.lexeme = z: gen (id1.lexeme, ‘:=’, id2.lexeme, ‘+’, id3.lexeme) will produce the three-address code: x := y + z
Note: variables and attribute values are evaluated by gen before being concatenated with the other parameters
Example: Generation of the three-address code for an assignment statement and an expression
Production Semantic RulesS→id := E S.code := E.code || gen (id.lexeme, :=, E.place)E→E1 + E2 E.place := newtemp;
E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)E→E1 * E2 E.place := newtemp;
E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)E→- E1 E.place := newtemp;
E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)E→(E1) E.place := newtemp;
E.code := E1.codeE→id E.place := id.lexeme;
E.code := ‘’ /* empty code */ the attribute place will hold the value of the grammar symbol the attribute code will hold the sequence of three-address statements evaluating the grammar
symbol The function newtemp returns a sequence of distinct names t1, t2, . . . in response to successive
calls The three-address code produced for the input a:= x + y * z will be:
t1 := y * zt2 := x + t1a := t2
This is found by concatenating, at the semantic action of each rule, the code that has been previously calculated to the code of this particular rule
For example, E→E1 + E2 should add the variable where the result of E1 is and the variable where the result of E2 is and put the result in a temporary variable
We can determine the names of the variables that contain the results of E1 and E2 by using synthesized attributes named place
Example: Semantic rules generating code for a while statement
Production Semantic RulesS→while E do S1 S.begin := newlabel;
S.after := newlabel;S.code := gen (S.begin, ‘:’) || E.code || gen (‘if’, E.place, ‘= 0 goto ’, S.after) || S1.code || gen (S.after, ‘:’)
The above semantic action will create the three-address code of the following form:L1 :
E.codeif E.place = 0 goto L2
43
S1.codegoto L1
L2:
6.2. Declarations The declaration is used by the compiler as a source of type-information that it will store in the
symbol table While processing the declaration, the compiler reserves memory area for the variables and stores
the relative address of each variable in the symbol table The relative address consists of an address from the static data area We use in this section a number of variables, attributes and procedure that help the processing of
the declaration The compiler maintains a global offset variable that indicates the first address not yet allocated Initially, offset is assigned 0. Each time an address is allocated to a variable, the offset is incremented
by the width of the data object denoted by the name The procedure enter(name, type, address) creates a symbol table entry for name, give it the type
type and the relative address address The synthesized attributes name and width for non-terminal T are also used to indicate the type and
number of memory units taken by objects of that type Example: Semantic actions for the declaration part We consider that an Integer and a pointer occupy four bytes and a real number occupies 8 bytes of
memory
Production Semantic RulesP→{offset := 0} D S
D→D; D
D→id :T {enter(id.name, T.type, offset);offset := offset + T.width}
T→integer {T.type := integer;T.width := 4}
T→real {T.type := real;T.width := 8}
T→array[num] of T1 {T.type := array (num.val, T1.type);T.width := num.val * T1.width}
T→^T1 {T.type := pointer (T1.type);T.width := 4}
Note: In languages where nested procedures are possible, we must have several symbol tables, one for each procedure6.3. Assignment Statements
Using the symbol table, below we are going to see how it is possible to generate the three-address code statements corresponding to assignments
44
In an earlier example, we have generated three-address code statements where variables are represented by their names
However, it is more common and practical for the implementation to represent the variables by their symbol table entries
The function lookup(lexeme) checks if there is an entry for this occurrence of the name in the symbol table, and if so a pointer to the entry is returned; otherwise nil is returned.
The newtemp function will generate temporary variables and reserve a memory area for the variables by modifying the offset and putting in the symbol table the reserved memories’ addresses
Example: generation of the three-address code for the assignment statement and simple expressions
Production Semantic RulesS→id := E p := lookup (id.name);
S.code := E.code || If p <> nil then gen (p.lexeme, ‘:=’, E.place) else errorE→E1 + E2 E.place := newtemp;
E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)E→E1 * E2 E.place := newtemp;
E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)E→- E1 E.place := newtemp;
E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)E→(E1) E.place := newtemp;
E.code := E1.codeE→id p := lookup (id.lexeme)
if p <> nil then E.place = p.lexeme else errorE.code := ’’ /* empty code */
Addressing Array Elements The elements of an array are stored in a block of consecutive locations If the width of each array element is w, the relative address of the array is base and the lower bound
of the index is low, then the ith element of the array is found at the address:base + (i – low) * w
For example for an array declared as A : array [5..10] of integer; if it is stored at the address 100, A[7] = 100 + (7 – 5) * 4
When the index is constant as above, it is possible to evaluate the address of A[i] at compile time However, most of the time the value of the index is not determined at compile time and the
compiler can only produce the three-address code statements that will calculate the address at execution time
However, even in this case, some of the calculation can be done at compile time Indeed, base + (i – low) * w = i * w + (base – low * w) base – low * w can be calculated at compile time and save time for the execution Note: The function width(arrayname) returns the width of the array called arrayname by looking in
the symbol table The function base(arrayname) returns the base of the array called arrayname by looking in the
symbol table
45
Example: generation of the three-address code for the assignment statement and expressions (including arrays references)
Production Semantic RulesS→L := E if L.offset = nil then /* L is a simple id */
S.code := L.code || E.code || gen (L.place, ‘:=’, E.place);else
S.code := L.code || E.code || gen (L.place, ‘[’, L.offset, ‘] :=’, E.place)
E→E1 + E2 E.place := newtemp;E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘+’, E2.place)
E→E1 * E2 E.place := newtemp;E.code := E1.code || E2.code || gen (E.place, ‘:=’, E1.place, ‘*’, E2.place)
E→- E1 E.place := newtemp;E.code := E1.code || gen (E.place, ‘:= uminus ’, E1.place)
E→(E1) E.place := newtemp;E.code := E1.code
E→L if L.offset = nil then /* L is simple */begin
E.place := L.place;E.code := L.code
endelse
beginE.place := newtemp;E.code := L.code || gen (E.place, ‘ :=’, L.place, ‘[’ , L.offset, ‘]’)
endL→id [E] L.place := newtemp;
L.offset := newtemp;L.code := E.code || gen (L.place, ‘:=’, base (id.lexeme) – width (id.lexeme) *low(id.lexeme)) || gen (L.offset, ‘:=’, E.place, ‘*’, width (id.lexeme));
L→id p := lookup (id.lexeme);if p <> nil then L.place = p.lexeme else error;L.offset := nil; /* for simple identifier */L.code := ‘’ /* empty code */
Example: Three-address code generation for the input X := A [y] A is stored at the address 100 and its values are integers (width = 4) and low = 1 The semantic actions will generate the following three-address code
t1 := 96t2 := y * 4t3 := t1[t2]x := t3
Exercise: produce the attributed parse tree (decorated parse tree) Example: three-address code generation for the input tab1[i + k] := x + tab2[j] tab1 is stored at the address 100 and its values are integers tab2 is stored at the address 200 and its values are integers
46
The semantic actions will generate the following three-address codet4 := i + kt5 := 96t6 := t1 * 4t1 := 196t2 := j * 4t3 := t1 + t2t5 [t6] := t3
Exercise: produce the attributed parse tree (decorated parse tree)6.4. Backpatching
The main problem for generating code for control statements in a single pass is that, during one single pass, we may not know the labels where the control must go at the time the jump statements are generated
We can solve this problem by generating jump statements where the targets are temporarily left unspecified
Each such statement will be put on a list of goto statements whose labels will be filled when determined
We call this backpatching and it is widely used in three-address code generation
47
7. Code Generation The final phase in our compiler model is the code generator It takes as an input an intermediate representation of the source program and produces as an
output an equivalent target program, as indicated in Figure 7.1
Figure 7.1: Position of code generator
8. Code Optimization
48