compiler design · cs6660 compiler design l t p c 3 0 0 3 unit i introduction to compilers 5...

COMPILER

DESIGN

G. Appasami, M.Sc., M.C.A., M.Phil., M.Tech., (Ph.D.)

Assistant Professor

Department of Computer Science and Engineering

Dr. Paul’s Engineering Collage

Pauls Nagar, Villupuram

Tamilnadu, India

SARUMATHI PUBLICATIONS

Villupuram, Tamilnadu, India

First Edition: July 2015

Second Edition: April 2016

Published By


© All rights reserved. No part of this publication can be reproduced or stored in any form or

by means of photocopy, recording or otherwise without the prior written permission of the

author.

Price Rs. 101/-

Copies can be had from


Villupuram, Tamilnadu, India.

[email protected]

Printed at

Meenam Offset

Pondicherry – 605001, India

CS6660 COMPILER DESIGN L T P C 3 0 0 3

UNIT I INTRODUCTION TO COMPILERS 5

Translators-Compilation and Interpretation-Language processors -The Phases of Compiler-

Errors Encountered in Different Phases-The Grouping of Phases-Compiler Construction

Tools - Programming Language basics.

UNIT II LEXICAL ANALYSIS 9

Need and Role of Lexical Analyzer-Lexical Errors-Expressing Tokens by Regular

Expressions- Converting Regular Expression to DFA- Minimization of DFA-Language for

Specifying Lexical Analyzers-LEX-Design of Lexical Analyzer for a sample Language.

UNIT III SYNTAX ANALYSIS 10

Need and Role of the Parser-Context Free Grammars -Top Down Parsing -General

Strategies- Recursive Descent Parser Predictive Parser-LL(1) Parser-Shift Reduce Parser-

LR Parser-LR (0) Item-Construction of SLR Parsing Table -Introduction to LALR Parser -

Error Handling and Recovery in Syntax Analyzer-YACC-Design of a syntax Analyzer for a

Sample Language .

UNIT IV SYNTAX DIRECTED TRANSLATION & RUN TIME ENVIRONMENT 12

Syntax directed Definitions-Construction of Syntax Tree-Bottom-up Evaluation of S-

Attribute Definitions- Design of predictive translator - Type Systems-Specification of a

simple type checker- Equivalence of Type Expressions-Type Conversions.

RUN-TIME ENVIRONMENT: Source Language Issues-Storage Organization-Storage

Allocation- Parameter Passing-Symbol Tables-Dynamic Storage Allocation-Storage

Allocation in FORTAN.

UNIT V CODE OPTIMIZATION AND CODE GENERATION 9

Principal Sources of Optimization-DAG- Optimization of Basic Blocks-Global Data Flow

Analysis- Efficient Data Flow Algorithms-Issues in Design of a Code Generator - A Simple

Code Generator Algorithm.

TOTAL: 45 PERIODS

TEXTBOOK:

1. Alfred V Aho, Monica S. Lam, Ravi Sethi and Jeffrey D Ullman, “Compilers –

Principles, Techniques and Tools”, 2nd Edition, Pearson Education, 2007.

REFERENCES:

1. Randy Allen, Ken Kennedy, “Optimizing Compilers for Modern Architectures: A

Dependence-based Approach”, Morgan Kaufmann Publishers, 2002.

2. Steven S. Muchnick, “Advanced Compiler Design and Implementation, “Morgan

Kaufmann Publishers - Elsevier Science, India, Indian Reprint 2003.

3. Keith D Cooper and Linda Torczon, “Engineering a Compiler”, Morgan Kaufmann

Publishers Elsevier Science, 2004.

4. Charles N. Fischer, Richard. J. LeBlanc, “Crafting a Compiler with C”, Pearson

Education, 2008.

Acknowledgement

I am very much grateful to the management of paul’s educational trust, Respected

principal Dr. Y. R. M. Rao, M.E., Ph.D., cherished Dean Dr. E. Mariappane, M.E., Ph.D.,

and helpful Head of the department Mr. M. G. Lavakumar M.E., (Ph.D.).

I thank my colleagues and friends for their cooperation and their support in my

career venture.

I thank my parents and family members for their valuable support in completion of

the book successfully.

I express my special thanks to SARUMATHI PUBLICATIONS for their continued

cooperation in shaping the work.

Suggestions and comments to improve the text are very much solicitated.

Mr. G. Appasami

TABLE OF CONTENTS

UNIT I INTRODUCTION TO COMPILERS

1.1 Translators 1.1

1.2 Compilation and Interpretation 1.1

1.3 Language processors 1.1

1.4 The Phases of Compiler 1.3

1.5 Errors Encountered in Different Phases 1.8

1.6 The Grouping of Phases 1.9

1.7 Compiler Construction Tools 1.10

1.8 Programming Language basics 1.10

UNIT II LEXICAL ANALYSIS

2.1 Need and Role of Lexical Analyzer 2.1

2.2 Lexical Errors 2.3

2.3 Expressing Tokens by Regular Expressions 2.3

2.4 Converting Regular Expression to DFA 2.6

2.5 Minimization of DFA 2.9

2.6 Language for Specifying Lexical Analyzers-LEX 2.10

2.7 Design of Lexical Analyzer for a sample Language 2.12

UNIT III SYNTAX ANALYSIS

3.1 Need and Role of the Parser 3.1

3.2 Context Free Grammars 3.1

3.3 Top Down Parsing -General Strategies 3.9

3.4 Recursive Descent Parser 3.10

3.5 Predictive Parser 3.11

3.6 LL(1) Parser 3.12

3.7 Shift Reduce Parser 3.14

3.8 LR Parser 3.15

3.9 LR (0) Item 3.17

3.10 Construction of SLR Parsing Table 3.18

3.11 Introduction to LALR Parser 3.22

3.12 Error Handling and Recovery in Syntax Analyzer 3.26

3.13 YACC 3.27

3.14 Design of a syntax Analyzer for a Sample Language 3.29

UNIT IV SYNTAX DIRECTED TRANSLATION & RUN TIME ENVIRONMENT

4.1 Syntax directed Definitions 4.1

4.2 Construction of Syntax Tree 4.2

4.3 Bottom-up Evaluation of S-Attribute Definitions 4.3

4.4 Design of predictive translator 4.6

4.5 Type Systems 4.7

4.6 Specification of a simple type checker 4.8

4.7 Equivalence of Type Expressions 4.10

4.8 Type Conversions 4.14

4.9 RUN-TIME ENVIRONMENT: Source Language Issues 4.16

4.10 Storage Organization 4.19

4.11 Storage Allocation 4.21

4.12 Parameter Passing 4.23

4.13 Symbol Tables 4.24

4.14 Dynamic Storage Allocation 4.28

4.15 Storage Allocation in FORTAN 4.29

UNIT V CODE OPTIMIZATION AND CODE GENERATION

5.1 Principal Sources of Optimization 5.1

5.2 DAG 5.8

5.3 Optimization of Basic Blocks 5.9

5.4 Global Data Flow Analysis 5.15

5.5 Efficient Data Flow Algorithms 5.19

5.6 Issues in Design of a Code Generator 5.21

5.7 A Simple Code Generator Algorithm 5.24

CS6660 __ Compiler Design Unit I _____1.1


1.1 TRANSLATORS

A translator is one kind of program that takes one form of program (input) and converts into

another form (output). The input program is called source language and the output program is

called target language.

The source language can be low level language like assembly language or a high level

language like C, C++, JAVA, FORTRAN, and so on.

The target language can be a low level language (assembly language) or a machine

language (set of instructions executed directly by a CPU).

Figure 1.1: Translator

Types of Translators are:

(1). Compilers

(2). Interpreters

(3). Assemblers

1.2 COMPILATION AND INTERPRETATION

A compiler is a program that reads a program in one language and translates it into an

equivalent program in another language. The translation done by a compiler is called compilation.

An interpreter is another common kind of language processor. Instead of producing a target

program as a translation, an interpreter appears to directly execute the operations specified in the

source program on inputs supplied by the user. An interpreter executes the source program

statement by statement. The translation done by an interpreter is called Interpretation.

1.3 LANGUAGE PROCESSORS ®

(i) Compiler

A compiler is a program that can read a program in one language (the source language) and

translate it into an equivalent program in another language (the target language) compilation is

shown in Figure 1.2.

Figure 1.2: A Compiler

An important role of the compiler is to report any errors in the source program that it detects

during the translation process.

If the target program is an executable machine-language program, it can then be called by

the user to process inputs and produce outputs.

Figure 1.3: Running the target program

Target Program Output Input

Compiler Target program

(Output)

Source program

(Input)

Target

language Translator

Source

language


(ii) Interpreter

An interpreter is another common kind of language processor. Instead of producing a target

program as a translation, an interpreter appears to directly execute the operations specified in the

source program on inputs supplied by the user, as shown in Figure 1.4.

Figure 1.4: An interpreter

The machine-language target program produced by a compiler is usually much faster than

an interpreter (mapping inputs to outputs is easy in compiler).

Compiler converts the source to target completely, but an interpreter executes the source

program statement by statement. Usually interpreter gives better error diagnostics than a Compiler.

(iii) Hybrid Compiler

Hybrid Compiler is combination of compilation and interpretation. Java language

processors combine compilation and interpretation as shown in Figure 1.4.

Java source program first be compiled into an intermediate form called bytecodes. The

bytecodes are then interpreted by a virtual machine.

A benefit of this arrangement is that bytecodes compiled on one machine can be interpreted

on another machine.

Figure 1.5: A hybrid compiler

In order to achieve faster processing of inputs to outputs, some Java compilers, called just-

in-time compilers, translate the bytecodes into machine language immediately before they run.

(iv) Language processing system ®

In addition to a compiler, several other programs may be required to create an executable

target program, as shown in Figure 1.6.

Preprocessor: Preprocessor collects the source program which is divided into modules and stored

in separate files. The preprocessor may also expand shorthands called macros into source language

statements. E.g. # include<math.h>, #define PI .14

Compiler: The modified source program is then fed to a compiler. The compiler may produce an

assembly-language program as its output. because assembly language is easier to produce as output

and is easier to debug.

Assembler: The assembly language is then processed by a program called an assembler that

produces relocatable machine code as its output.

Source program

Translator

Output Virtual

Machine

Intermediate program

Input

Interpreter Output Input

Source Program


Linker: The linker resolves external memory addresses, where the code in one file may refer to a

location in another file. Large programs are often compiled in pieces, so the relocatable machine

code may have to be linked together with other relocatable object files and library files into the

code that actually runs on the machine.

Loader: The loader then puts together all of the executable object files into memory for execution.

It also performs relocation of an object code.

Figure 1.6: A language-processing system

Note: Preprocessors, Assemblers, Linkers and Loader are collectively called cousins of compiler.

1.4 THE PHASES OF COMPILER / STRUCTURE OF COMPILER ®

The process of compilation carried out in two parts, they are analysis and synthesis. The

analysis part breaks up the source program into constituent pieces and imposes a grammatical

structure on them.

It then uses this structure to create an intermediate representation of the source program.

The analysis part also collects information about the source program and stores it in a data structure

called a symbol table, which is passed along with the intermediate representation to the synthesis

part.

The analysis part carried out in three phases, they are lexical analysis, syntax analysis and

Semantic Analysis. The analysis part is often called the front end of the compiler. The synthesis part

constructs the desired target program from the intermediate representation and the information in

the symbol table.

The synthesis part carried out in three phases, they are Intermediate Code Generation,

Code Optimization and Code Generation. The synthesis part is called the back end of the compiler.


Figure 1.7: Phases of a compiler

1.4.1 Lexical Analysis

The first phase of a compiler is called lexical analysis or scanning or linear analysis. The

lexical analyzer reads the stream of characters making up the source program and groups the

characters into meaningful sequences called lexemes.

For each lexeme, the lexical analyzer produces as output a token of the form

<token-name, attribute-value>

The first component token-name is an abstract symbol that is used during syntax analysis,

and the second component attribute-value points to an entry in the symbol table for this token.

Information from the symbol-table entry 'is needed for semantic analysis and code generation.

For example, suppose a source program contains the assignment statement

position = initial + rate * 60 (1.1)


Figure 1.8: Translation of an assignment statement

The characters in this assignment could be grouped into the following lexemes and mapped into the

following tokens.

(1) position is a lexeme that would be mapped into a token <id,1>. where id is an abstract

symbol standing for identifier and 1 points to the symbol able entry for position.

(2) The assignment symbol = is a lexeme that is mapped into the token <=>.

(3) initial is a lexeme that is mapped into the token <id, 2>.

(4) + is a lexeme that is mapped into the token <+>.

(5) rate is a lexeme that is mapped into the token <id, 3>.

(6) * is a lexeme that is mapped into the token <*>.

(7) 60 is a lexeme that is mapped into the token <60>.

Blanks separating the lexemes would be discarded by the lexical analyzer. The sequence of

tokens produced as follows after lexical analysis.

<id, 1> <=> <id, 2> <+> <id, 3> <*> <60> (1.2)


1.4.2 Syntax Analysis

The second phase of the compiler is syntax analysis or parsing or hierarchical analysis.

The parser uses the first components of the tokens produced by the lexical analyzer to create

a tree-like intermediate representation that depicts the grammatical structure of the token stream.

The hierarchical tree structure generated in this phase is called parse tree or syntax tree.

In a syntax tree, each interior node represents an operation and the children of the node

represent the arguments of the operation.

Figure 1.9: Syntax tree for position = initial + rate * 60

The tree has an interior node labeled * with <id, 3> as its left child and the integer 60 as its

right child. The node <id, 3> represents the identifier rate. Similarly <id,2> and <id, 1> are

represented as in tree. The root of the tree, labeled =, indicates that we must store the result of this

addition into the location for the identifier position.

1.4.3 Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to check

the source program for semantic consistency with the language definition.

It ensures the correctness of the program, matching of the parenthesis is also done in this

phase.

It also gathers type information and saves it in either the syntax tree or the symbol table, for

subsequent use during intermediate-code generation.

An important part of semantic analysis is type checking, where the compiler checks that

each operator has matching operands.

The compiler must report an error if a floating-point number is used to index an array. The

language specification may permit some type conversions like integer to float for float addition is

called coercions.

The operator * is applied to a floating-point number rate and an integer 60. The integer may

be converted into a floating-point number by the operator inttofloat explicitly as shown in the

figure.

Figure 1.10: Semantic tree for position = initial + rate * 60

1.4.4 Intermediate Code Generation

After syntax and semantic analysis of the source program, many compilers generate an

explicit low-level or machine-like intermediate representation.

The intermediate representation have two important properties:

a. It should be easy to produce

b. It should be easy to translate into the target machine.


Three-address code is one of the intermediate representations, which consists of a sequence

of assembly-like instructions with three operands per instruction. Each operand can act like a

register.

The output of the intermediate code generator in Figure 1.8 consists of the three-address code

sequence for position = initial + rate * 60

t1 = inttofloat(60)

t2 = id3 * t1

t3 = id2 + t2

id1 = t3 (1.3)

1.4.5 Code Optimization

The machine-independent code-optimization phase attempts to improve the intermediate

code so that better target code will result. Usually better means faster.

Optimization has to improve the efficiency of code so that the target program running time

and consumption of memory can be reduced.

The optimizer can deduce that the conversion of 60 from integer to floating point can be

done once and for all at compile time, so the inttofloat operation can be eliminated by replacing the

integer 60 by the floating-point number 60.0.

Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform

(1.3) into the shorter sequence

t1 = id3 * 60.0

id1 = id2 + t1 (1.4)

1.4.6 Code Generation

The code generator takes as input an intermediate representation of the source program and

maps it into the target language.

If the target language is machine code, then the registers or memory locations are selected

for each of the variables used by the program.

The intermediate instructions are translated into sequences of machine instructions.

For example, using registers R1 and R2, the intermediate code in (1.4) might get translated

into the machine code

LDF R2, id3

MULF R2, R2 , #60.0

LDF Rl, id2

ADDF Rl, Rl, R2

STF idl, Rl (1.5)

The first operand of each instruction specifies a destination. The F in each instruction tells

us that it deals with floating-point numbers.

The code in (1.5) loads the contents of address id3 into register R2, then multiplies it with

floating-point constant 60.0. The # signifies that 60.0 is to be treated as an immediate constant. The

third instruction moves id2 into register R1 and the fourth adds to it the value previously computed

in register R2. Finally, the value in register R1 is stored into the address of id1, so the code

correctly implements the assignment statement (1.1).


1.4.7 Symbol-Table Management

The symbol table, which stores information about the entire source program, is used by

all phases of the compiler.

An essential function of a compiler is to record the variable names used in the source

program and collect information about various attributes of each name.

These attributes may provide information about the storage allocated for a name, its

type, its scope.

In the case of procedure names, such things as the number and types of its arguments,

the method of passing each argument (for example, by value or by reference), and the

type returned are maintained in symbol table.

The symbol table is a data structure containing a record for each variable name, with

fields for the attributes of the name. The data structure should be designed to allow the

compiler to find the record for each name quickly and to store or retrieve data from that

record quickly.

A symbol table can be implemented in one of the following ways:

o Linear (sorted or unsorted) list

o Binary Search Tree

o Hash table

Among the above all, symbol tables are mostly implemented as hash tables, where the

source code symbol itself is treated as a key for the hash function and the return value is

the information about the symbol.

A symbol table may serve the following purposes depending upon the language in hand:

o To store the names of all entities in a structured form at one place.

o To verify if a variable has been declared.

o To implement type checking, by verifying assignments and expressions.

o To determine the scope of a name (scope resolution).

1.5 ERRORS ENCOUNTERED IN DIFFERENT PHASES

An important role of the compiler is to report any errors in the source program that it

detects during the entire translation process.

Each phases of compiler can encounter errors, after detecting errors, must be corrected

to precede compilation process.

The syntax and semantic phases handles large number of errors in compilation process.

Error handler handles all types of errors like lexical errors, syntax errors, semantic

errors and logical errors.

Lexical errors:

Lexical analyzer detects errors from input characters.

Name of some keywords identifiers typed incorrectly.

Example: switch is written as swich.

Syntax errors:

Syntax errors are detected by syntax analyzer.

Errors like semicolon missing or unbalanced parenthesis.

Example: ((a+b* (c-d)). In this statement ) missing after b.

Semantic errors:

Data type mismatch errors handled by semantic analyzer.

Incompatible data type vale assignment.

Example: Assigning a string value to integer.

Logical errors:

Code note reachable and infinite loops.

Misuse of operators. Codes written after end of main() block.


1.6 THE GROUPING OF PHASES ®

Each phases deals with the logical organization of a compiler.

Activities of several phases may be grouped together into a pass that reads an input

file and writes an output file.

The front-end phases of lexical analysis, syntax analysis, semantic analysis, and

intermediate code generation might be grouped together into one pass.

Code optimization might be an optional pass.

A back end pass consisting of code generation for a particular target machine.

Figure 1.11: The Grouping of Phases of compiler

Some compiler collections have been created around carefully designed intermediate

representations that allow the front end for a particular language to interface with the back end for a

certain target machine.

Advantages:

With these collections, we can produce compilers for different source languages for one

target machine by combining different front ends.

Similarly, we can produce compilers for different target machines, by combining a front end

for different target machines.

Front end

Source language dependent

(Machine independent)

Lexical Analyzer

Syntax analyzer

Semantic Analyzer

Intermediate

Code Generator

Source program (input)

Target program (output)

Back end

Machine dependent

(Source language dependent)

Code optimizer

(optional)

Code Generator

Intermediate code


1.7 COMPILER CONSTRUCTION TOOLS ®

The compiler writer, like any software developer, can profitably use modern software

development environments containing tools such as language editors, debuggers, version managers,

profilers, test harnesses, and so on.

Writing a compiler is a tedious and time consuming task; there are some specialized tools to

implement various phases of a compiler. These tools are called Compiler Construction Tools.

Some commonly used compiler-construction tools are given below:

Scanner generators [Lexical Analysis]

Parser generators [Syntax Analysis]

Syntax-directed translation engines [Intermediate Code]

Data-flow analysis engines [Code Optimization]

Code-generator generators [Code Generation]

Compiler-construction toolkits [For all phases]

1. Scanner generators that produce lexical analyzers from a regular-expression description of

the tokens of a language. Unix has a tool for Scanner generator called LEX.

2. Parser generators that automatically produce syntax analyzers (parse tree) from a

grammatical description of a programming language. Unix has a tool called YACC which is

a parser generator.

3. Syntax-directed translation engines that produce collections of routines for walking a parse

tree and generating intermediate code.

4. Data-flow analysis engines that facilitate the gathering of information about how values are

transmitted from one part of a program to each other part. Data-flow analysis is a key part

of code optimization.

5. Code-generator generators that produce a code generator from a collection of rules for

translating each operation of the intermediate language into the machine language for a

target machine.

6. Compiler-construction toolkits that provide an integrated set of routines for constructing

various phases of a compiler.

1.8 PROGRAMMING LANGUAGE BASICS.

To design an efficient compiler we should know some language basics. Important concepts

from popular programming languages like C, C++, C#, and Java are listed below.

Some of the Programming Language basics which are used in most of the languages are

listed below. They are:

The Static/Dynamic Distinction

Environments and States

Static Scope and Block Structure

Explicit Access Control

Dynamic Scope

Parameter Passing Mechanisms

Aliasing


1.8.1 The Static/Dynamic Distinction

The language uses a static policy or that the issue can be decided at compile time. On the

other hand, a policy that only allows a decision to be made when we execute the program is said to

be a dynamic policy or to require a decision at run time.

The scope of a declaration of x is the region of the program in which uses of x refer to this

declaration. A language uses static scope or lexical scope if it is possible to determine the scope of

a declaration by looking only at the program. Otherwise, the language uses dynamic scope. With

dynamic scope, as the program runs, the same use of x could refer to any of several different

declarations of x.

Example: consider the use of the term "static" as it applies to data in a Java class declaration. In

Java, a variable is a name for a location in memory used to hold a data value. Here, "static" refers

not to the scope of the variable, but rather to the ability of the compiler to determine the location in

memory where the declared variable can be found. A declaration like

public static int x;

This makes x a class variable and says that there is only one copy of x, no matter how many

objects of this class are created. Moreover, the compiler can determine a location in memory where

this integer x will be held. In contrast, had "static" been omitted from this declaration, then each

object of the class would have its own location where x would be held, and the compiler could not

determine all these places in advance of running the program.

1.8.2 Environments and States

Programming languages affect the values of data elements or affect the interpretation of

names for that data changes, as the program runs. For example, the execution of an assignment

such as x = y + 1 changes the value denoted by the name x. More specifically, the assignment

changes the value in whatever location is denoted by x.

The location denoted by x can change at run time. If x is not a static (or "class") variable,

then every object of the class has its own location for an instance of variable x. In that case, the

assignment to x can change any of those "instance" variables, depending on the object to which a

method containing that assignment is applied.

environment state

names --------------------------locations(variables)---------------------- values

The association of names with locations in memory (the store) and then with values can be

described by two mappings that change as the program runs:

1. The environment is a mapping from names to locations in the store. Since variables

refer to locations ('l-values" in the terminology of C), we could alternatively define an

environment as a mapping from names to variables.

2. The state is a mapping from locations in store to their values. That is, the state maps

1-values to their corresponding r-values, in the terminology of C.

Environments change according to the scope rules of a language.

Example: Consider the C program fragment, Integer i is declared a global variable, and also

declared as a variable local to function f. When f is executing, the environment adjusts so that name

i refers to the location reserved for the i that is local to f, and any use of i, such as the assignment

i = 3 shown explicitly, refers to that location.


Typically, the local i is given a place on the run-time stack.

…

int i; /* global i */

...

void f(..) {

int i; /* local i */

…

i=3; /* use of local i */

…

}

…

x=i+1; /* use of global i */

Whenever a function g other than f is executing, uses of i cannot refer to the i that is local to

f. Uses of name i in g must be within the scope of some other declaration of i. An example is the

explicitly shown statement x = i+l, which is inside some procedure whose definition is not shown.

The i in i + 1 presumably refers to the global i.

1.8.3 Static Scope and Block Structure

The scope rules for C are based on program structure; the scope of a declaration is

determined implicitly by where the declaration appears in the program. Later languages, such as

C++, Java, and C#, also provide explicit control over scopes through the use of keywords like

public, private, and protected.

A block is a grouping of declarations and statements. C uses braces { and } to delimit a

block; the alternative use of begin and end in some languages.

Example: The C++ program in Fig. 1.10 has four blocks, with several definitions of variables a

and b. As a memory aid, each declaration initializes its variable to the number of the block to which

it belongs.

Output

3 2

1 4

1 2

1 1

Figure 1.12: Blocks in a C++ program


Consider the declaration int a = 1 in block B1. Its scope is all of B1, except for those blocks

nested within B1 that have their own declaration of a. B2, nested immediately within B1, does not

have a declaration of a, but B3 does. B4 does not have a declaration of a, so block B3 is the only

place in the entire program that is outside the scope of the declaration of the name a that belongs to

B1. That is, this scope includes B4 and all of B2 except for the part of B2 that is within B3. The

scopes of all five declarations are summarized in Figure 1.13.

Figure 1.13: Scopes of declarations

1.8.4 Explicit Access Control

Classes and structures introduce a new scope for their members. If p is an object of a class

with a field (member) x, then the use of x in p.x refers to field x in the class definition. the scope of

a member declaration x in a class C extends to any subclass C', except if C' has a local declaration

of the same name x.

Through the use of keywords like public, private, and protected, object oriented

languages such as C++ or Java provide explicit control over access to member names in a super

class. These keywords support encapsulation by restricting access.

Thus, private names are purposely given a scope that includes only the method declarations

and definitions associated with that class and any "friend" classes (the C++ term). Protected names

are accessible to subclasses. Public names are accessible from outside the class.

1.8.5 Dynamic Scope

Technically, any scoping policy is dynamic if it is based on factor(s) that can be known only

when the program executes. The term dynamic scope, however, usually refers to the following

policy: a use of a name x refers to the declaration of x in the most recently called procedure with

such a declaration.

Dynamic scoping of this type appears only in special situations.

We shall consider two examples of dynamic policies: macro expansion in the C

preprocessor and method resolution in object-oriented programming.

Example: In the C program, identifier a is a macro that stands for expression (x + I). But we cannot

resolve x statically, that is, in terms of the program text.

#define a (x+1)

int x = 2;

void b() { int x = 1 ; printf (“%d\n”, a); }

void c() { printf("%d\n”, a); }

void main() { b(); c(); }

In fact, in order to interpret x, we must use the usual dynamic-scope rule. the function main

first calls function b. As b executes, it prints the value of the macro a. Since (x + 1) must be

substituted for a, we resolve this use of x to the declaration int x=l in function b. The reason is that

b has a declaration of x, so the (x + 1) in the printf in b refers to this x. Thus, the value printed is 1.


After b finishes, and c is called, we again need to print the value of macro a. However, the

only x accessible to c is the global x. The printf statement in c thus refers to this declaration of x,

and value 2 is printed.

1.8.6 Parameter Passing Mechanisms

All programming languages have a notion of a procedure, but they can differ in how these

procedures get their arguments. The actual parameters (the parameters used in the call of a

procedure) are associated with the formal parameters (those used in the procedure definition).

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a

variable). The value is placed in the location belonging to the corresponding formal parameter of

the called procedure. This method is used in C and Java.

In call- by-reference, the address of the actual parameter is passed to the callee as the value

of the corresponding formal parameter. Uses of the formal parameter in the code of the callee are

implemented by following this pointer to the location indicated by the caller. Changes to the formal

parameter thus appear as changes to the actual parameter.

A third mechanism call-by-name was used in the early programming language Algol 60. It

requires that the callee execute as if the actual parameter were substituted literally for the formal

parameter in the code of the callee, as if the formal parameter were a macro standing for the actual

parameter.

1.8.7 Aliasing

There is an interesting consequence of call-by-reference parameter passing or its simulation,

as in Java, where references to objects are passed by value. It is possible that two formal

parameters can refer to the same location; such variables are said to be aliases of one another. As

a result, any two variables, which may appear to take their values from two distinct formal

parameters, can become aliases of each other.

Example: Suppose a is an array belonging to a procedure p, and p calls another procedure q(x, y)

with a call q(a, a). Suppose also that parameters are passed by value, but that array names are really

references to the location where the array is stored, as in C or similar languages. Now, x and y have

become aliases of each other. The important point is that if within q there is an assignment

x [10] = 2, then the value of y[10] also becomes 2.

CS6660 Compiler Design Unit II 2.1


2.1 NEED AND ROLE OF LEXICAL ANALYZER

Lexical Analysis is the first phase of compiler. It reads the input characters from left to

right, one character at a time, from the source program.

It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit

such as identifiers, keywords, operators and punctuation marks.

It needs to enter that lexeme into the symbol table and also reads from the symbol table.

These interactions are suggested in Figure 2.1.

Figure 2.1: Interactions between the lexical analyzer and the parser

Since the lexical analyzer is the part of the compiler that reads the source text, it may

perform certain other tasks besides identification of lexemes. One such task is stripping out

comments and whitespace (blank, newline, tab). Another task is correlating error messages

generated by the compiler with the source program.

Needs / Roles / Functions of lexical analyzer

It produces stream of tokens.

It eliminates comments and whitespace.

It keeps track of line numbers.

It reports the error encountered while generating tokens.

It stores information about identifiers, keywords, constants and so on into symbol table.

Lexical analyzers are divided into two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input, such

as deletion of comments and compaction of consecutive whitespace characters into one.

b) Lexical analysis is the more complex portion, where the scanner produces the sequence of

tokens as output.

Lexical Analysis versus Parsing / Issues in Lexical analysis

1. Simplicity of design: It is the most important consideration. The separation of lexical and

syntactic analysis often allows us to simplify tasks. whitespace and comments removed by

the lexical analyzer.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply

specialized techniques that serve only the lexical task, not the job of parsing. In addition,

specialized buffering techniques for reading input characters can speed up the compiler

significantly.

3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to

the lexical analyzer.

Tokens, Patterns, and Lexemes

A token is a pair consisting of a token name and an optional attribute value. The token name

is an abstract symbol representing a kind of single lexical unit, e.g., a particular keyword, or a


sequence of input characters denoting an identifier. Operators, special symbols and constants are

also typical tokens.

A pattern is a description of the form that the lexemes of a token may take. Pattern is set of

rules that describe the token. A lexeme is a sequence of characters in the source program that

matches the pattern for a token.

Table 2.1: Tokens and Lexemes

TOKEN INFORMAL DESCRIPTION

(PATTERN)

SAMPLE LEXEMES

if characters i, f if

else characters e, l, s, e else

comparison < or > or <= or >= or == or != <=, !=

id Letter, followed by letters and digits pi, score, D2, sum, id_1, AVG

number any numeric constant 35, 3.14159, 0, 6.02e23

literal anything surrounded by “ ” “Core”, “Design” “Appasami”,

In many programming languages, the following classes cover most or all of the tokens:

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.

2. Tokens for the operators, either individually or in classes such as the token comparison

mentioned in table 2.1.

3. One token representing all identifiers.

4. One or more tokens representing constants, such as numbers and literal strings.

5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and

semicolon

Attributes for Tokens

When more than one lexeme can match a pattern, the lexical analyzer must provide the

subsequent compiler phases additional information about the particular lexeme that matched.

The lexical analyzer returns to the parser not only a token name, but an attribute value that

describes the lexeme represented by the token.

The token name influences parsing decisions, while the attribute value influences translation

of tokens after the parse.

Information about an identifier - e.g., its lexeme, its type, and the location at which it is first

found (in case an error message) - is kept in the symbol table.

Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry

for that identifier.

Example: The token names and associated attribute values for the Fortran statement E = M

* C ** 2 are written below as a sequence of pairs.

<id, pointer to symbol-table entry for E>

< assign_op >

<id, pointer to symbol-table entry for M>

<mult_op>

<id, pointer to symbol-table entry for C>

<exp_op>

<number, integer value 2 >

Note that in certain pairs, especially operators, punctuation, and keywords, there is no need

for an attribute value. In this example, the token number has been given an integer-valued attribute.


2.2 LEXICAL ERRORS

It is hard for a lexical analyzer to tell that there is a source-code error without the aid of

other components.

Consider a C program statement fi ( a == f(x)). The lexical analyzer cannot tell whether fi

is a misspelling of the keyword if or an undeclared function identifier. Since fi is a valid lexeme for

the token id, the lexical analyzer must return the token id to the parser.

The lexical analyzer is unable to proceed because none of the patterns for tokens matches

any prefix of the remaining input. The simplest recovery strategy is "panic mode" recovery.

We delete successive characters from the remaining input, until the lexical analyzer can find

a well-formed token at the beginning of what input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Transformations like these may be tried in an attempt to repair the input. The simplest such

strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by

a single transformation.

In practice most lexical errors involve a single character. A more general correction strategy

is to find the smallest number of transformations needed to convert the source program into one

that consists only of valid lexemes.

2.3 EXPRESSING TOKENS BY REGULAR EXPRESSIONS

Specification of Tokens

Regular expressions are an important notation for specifying lexeme patterns. We cannot

express all possible patterns, they are very effective in specifying those types of patterns that we

actually need for tokens.

Strings and Languages

An alphabet is any finite set of symbols. Examples of symbols are letters, digits, and

punctuation. The set {0,1) is the binary alphabet. ASCII is an important example of an alphabet.

A string (sentence or word) over an alphabet is a finite sequence of symbols drawn from

that alphabet. The length of a string s, usually written |s|, is the number of occurrences of symbols

in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of

length zero.

A language is any countable set of strings over some fixed alphabet. Abstract languages

like Φ, the empty set, or { ε }, the set containing only the empty string, are languages under this

definition.

Parts of Strings:

1. A prefix of string s is any string obtained by removing zero or more symbols from the

end of s. For example, ban, banana, and ε are prefixes of banana.

2. A sufix of string s is any string obtained by removing zero or more symbols from the

beginning of s. For example, nana, banana, and ε are suffixes of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For instance,

banana, nan, and ε are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes,

and substrings, respectively, of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily

consecutive positions of s. For example, baan is a subsequence of banana.


6. If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed

by appending y to x.

Operations on Languages

In lexical analysis, the most important operations on languages are union, concatenation,

and closure, which are defined in table 2.2.

Table 2.2: Definitions of operations on languages

Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be the set of digits {0,1,..

.9). Other languages that can be constructed from languages L and D

1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of

length one, each of which strings is either one letter or one digit.

2. LD is the set df 520 strings of length two, each consisting of one letter followed by one

digit.

3. L4 is the set of all 4-letter strings.

4. L* is the set of ail strings of letters, including e, the empty string.

5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.

6. D+ is the set of all strings of one or more digits.

Regular expression

Regular expression can be defined as a sequence of symbols and characters expressing a

string or pattern to be searched.

Regular expressions are mathematical representation which describes the set of strings of

specific language.

Regular expression for identifiers represented by letter_ ( letter_ | digit )*. The vertical bar

means union, the parentheses are used to group subexpressions, and the star means "zero or more

occurrences of".

Each regular expression r denotes a language L(r), which is also defined recursively from

the languages denoted by r's subexpressions.

The rules that define the regular expressions over some alphabet Σ. Basis rules:

1. ε is a regular expression, and L(ε) is { ε }. 2. If a is a symbol in Σ , then a is a regular expression, and L(a) = {a}, that is, the

language with one string of length one.

Induction rules: Suppose r and s are regular expressions denoting languages L(r) and L(s),

respectively.

1. (r) | (s) is a regular expression denoting the language L(r) U L(s).

2. (r) (s) is a regular expression denoting the language L(r) L(s) .

3. (r) * is a regular expression denoting (L (r)) * .

4. (r) is a regular expression denoting L(r). i.e., Additional pairs of parentheses

around expressions.

Example: Let Σ = {a, b}.


Regular

expression

Language Meaning

a|b {a, b} Single ‘a’ or ‘b’ (a|b) (a|b) {aa, ab, ba, bb} All strings of length two over the alphabet Σ

a* { ε, a, aa, aaa, …} Consisting of all strings of zero or more a's

(a|b)* {ε, a, b, aa, ab, ba, bb,

aaa, …}

set of all strings consisting of zero or more

instances of a or b

a|a*b {a, b, ab, aab, aaab, …} String a and all strings consisting of zero or

more a's and ending in b

A language that can be defined by a regular expression is called a regular set. If two regular

expressions r and s denote the same regular set, we say they are equivalent and write r = s. For

instance, (a|b) = (b|a), (a|b)*= (a*b*)*, (b|a)*= (a|b)*, (a|b) (b|a) =aa|ab|ba|bb.

Algebraic laws

Algebraic laws that hold for arbitrary regular expressions r, s, and t:

LAW DESCRIPTION

r|s = s|r | is commutative

r(s|t) = (r|s)t | is associative

r(st) = (rs)t Concatenation is associative

r(s|t) = rs|rt; (s|t)r = sr|tr Concatenation distributes over |

ε r = r ε = r ε is the identity for concatenation

r* = (r |ε)* ε is guaranteed in a closure

r** = r* * is idempotent

Extensions of Regular Expressions

Few notational extensions that were first incorporated into Unix utilities such as Lex that

are particularly useful in the specification lexical analyzers.

1. One or more instances: The unary, postfix operator + represents the positive closure of

a regular expression and its language. If r is a regular expression, then (r)+ denotes the

language (L(r))+. The two useful algebraic laws, r* = r

+|ε and r+ = rr* = r*r.

2. Zero or one instance: The unary postfix operator ? means "zero or one occurrence."

That is, r? is equivalent to r|ε , L(r?) = L(r) U {ε}. 3. Character classes: A regular expression a1|a2|…|an, where the ai's are each symbols of

the alphabet, can be replaced by the shorthand [a1, a2, …an]. Thus, [abc] is shorthand

for a|b|c, and [a-z] is shorthand for a|b|…|z. Example: Regular definition for C identifier

Letter_ [A-Z a-z_]

digit [0-9]

id letter_ ( letter_ | digit )*

Example: Regular definition unsigned integer

digit [0-9]

digits digit+

number digits ( . digits)? ( E [+ -]? digits )?

Note: The operators *, +, and ? has the same precedence and associativity.


2.4 CONVERTING REGULAR EXPRESSION TO DFA

To construct a DFA directly from a regular expression, we construct its syntax tree and then

compute four functions: nullable, firstpos, lastpos, and followpas, defined as follows. Each

definition refers to the syntax tree for a particular augmented regular expression (r)#.

1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented

by n has ε in its language. That is, the subexpression can be "made null" or the empty

string, even though there may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n that correspond to the first

symbol of at least one string in the language of the subexpression rooted at n.

3. lastpos(n) is the set of positions in the subtree rooted at n that correspond to the last

symbol of at least one string in the language of the subexpression rooted at n.

4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that

there is some string x = a1a2 …an in L((r)#) such that for some i, there is a way to

explain the membership of x in L((r)#) by matching ai to position p of the syntax tree

and ai+1 to position q.

We can compute nullable, firstpos, and lastpos by a straightforward recursion on the height

of the tree. The basis and inductive rules for nullable and firstpos are summarized in table.

The rules for lastpos are essentially the same as for firstpos, but the roles of children c1 and

c2 must be swapped in the rule for a cat-node.

There are only two ways to compute followpos.

1. If n is a cat-node with left child cl and right child c2, then for every position i in

lastpos(c1), all positions in firstpos(c2) are in followpos(i).

2. 2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are

in followpos(i) .

Converting a Regular Expression Directly to a DFA

Algorithm: Construction of a DFA from a regular expression r.

INPUT: A regular expression r.

OUTPUT: A DFA D that recognizes L(r).

METHOD:

1. Construct a syntax tree T from the augmented regular expression (r)#.

2. Compute nullable, firstpos, lastpos, and followpos for T.

3. Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D,


By

the above procedure. The states of D are sets of positions in T. Initially, each state is

"unmarked," and a state becomes "marked" just before we consider its out-transitions. The start

state of D is firstpos(no), where node no is the root of T. The accepting states are those

containing the position for the endmarker symbol #.

Example: Construct a DFA for the regular expression r = (a|b)*abb

Figure 2.2: Syntax tree for (a|b)*abb#

Figure 2.3 : firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#

initialize Dstates to contain only the unmarked state firstpos(no),

where no is the root of syntax tree T for (r)#;

while ( there is an unmarked state S in Dstates )

{

mark S;

for ( each input symbol a )

{

let U be the union of followpos(p) for all p in S that correspond to a;

if ( U is not in Dstates )

add U as an unmarked state to Dstates;

Dtran[S, a] = U

}

}


We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both

followpos(1) and followpos(2), since both firstpas and lastpos for this node are {1,2}. The complete

sets followpos are summarized in table

NODE n Followpos(n)

1 {1, 2, 3}

2 {1, 2, 3}

3 {4}

4 {4}

5 {4}

6 {}

Figure 2.4: Directed graph for the function followpos

nullable is true only for the star-node, and we exhibited firstpos and lastpos in Figure 2.3.

The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. all this set

of states A. We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and 3

correspond to a, while 2 corresponds to b. Thus, Dtran[A, a] = followpos(1) U followpos(3) = {1,

2,3,4}, and Dtran[A, b] = followpos(2) = {1,2,3}.

Figure 2.5: DFA constructed for (a|b)*abb#

The latter is state A, and so does not have to be added to Dstates, but the former, B =

{1,2,3,4}, is new, so we add it to Dstates and proceed to compute its transitions. The omplete DFA

is shown in Figure 2.5.

Example: Construct NFA ε for (alb)*abb and convert to DFA by subset construction.

Figure 2.6: NFA ε for (a|b)*abb


Figure 2.7: NFA for (a|b)*abb

Figure 2.8 Result of applying the subset construction to Figure 2.6

2.5 MINIMIZATION OF DFA

There can be many DFA's that recognize the same language. For instance, the DFAs of

Figure 2.5 and 2.8 both recognize the same language L((a|b)*abb).

We would generally prefer a DFA with as few states as possible, since each state requires

entries in the table that describes the lexical analyzer.

Algorithm: Minimizing the number of states of a DFA.

INPUT: A DFA D with set of states S, input alphabet Σ, initial state so, and set of accepting states

F.

OUTPUT: A DFA D' accepting the same language as D and having as few states as possible.

METHOD:

1. Start with an initial partition II with two groups, F and S - F, the accepting and

nonaccepting states of D.

2. Apply the procedure of Fig. 3.64 to construct a new partition anew.

initially, let Πnew = Π; for ( each group G of Π ) {

partition G into subgroups such that two states s and t are in the same subgroup if

and only if for all

input symbols a, states s and t have transitions on a to states in the same group of Π;

/* at worst, a state will be in a subgroup by itself */

replace G in IInew by the set of all subgroups formed;

}

3. If Π new = Π, let Πfinal = Π and continue with step (4). Otherwise, repeat step (2) with Π new

in place of If Π. 4. Choose one state in each group of Πfinal as the representative for that group. The

representatives will be the states of the minimum-state DFA D'.

5. The other components of D' are constructed as follows:

(a) The state state of D' is the representative of the group containing the start state of D.

(b) The accepting states of D' are the representatives of those groupsthat contain an

accepting state of D.


(c) Let s be the representative of some group G of Πfina, and let the transition of D from

s on input a be to state t. Let r be the representative of t's group H. Then in D', there

is a transition from s to r on input a.

Example: Let us reconsider the DFA of Figure 2.8 for minimization.

STATE a b

A B C

B B D

C B C

D B E

(E) B C

The initial partition consists of the two groups {A, B, C, D} {E}, which are respectively the

nonaccepting states and the accepting states.

To construct Π new, the procedure considers both groups and inputs a and b. The group {E}

cannot be split, because it has only one state, so (E} will remain intact in Π new.

The other group {A, B, C, D} can be split, so we must consider the effect of each input

symbol. On input a, each of these states goes to state B, so there is no way to distinguish these

states using strings that begin with a. On input b, states A, B, and C go to members of group {A, B,

C, D}, while state D goes to E, a member of another group.

Thus, in Π new, group {A, B, C, D} is split into {A, B, C}{D}, and Π new for this round is

{A, B, C){D){E}.

In the next round, we can split {A, B, C} into {A, C}{B}, since A and C each go to a

member of {A, B, C) on input b, while B goes to a member of another group, {D}. Thus, after the

second round, Π new = {A, C} {B} {D} {E).

For the third round, we cannot split the one remaining group with more than one state, since

A and C each go to the same state (and therefore to the same group) on each input. We conclude

that Πfinal = {A, C}{B){D){E).

Now, we shall construct the minimum-state DFA. It has four states, corresponding to the

four groups of Πfinal, and let us pick A, B, D, and E as the representatives of these groups. The

initial state is A, and the only accepting state is E.

Table : Transition table of minimum-state DFA

STATE a b

A B A

B B D

C B E

(E) B A

2.6 LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS-LEX

There are wide range f tools for construction of lexical analyzer based on regular

expressions. Lex is a tool (Computer program) that generates lexical analyzers.

Lex is a lexical analyzer based tool by specifying regular expressions to describe patterns

for token. Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.

Use of Lex

The Lex compiler transforms the input patterns into a transition diagram and

generates code.


An input file “lex.l” is written in the Lex language and describes the lexical analyzer

to be generated. The Lex compiler transforms “lex.l” to a C program, in a file that is

always named “lex.yy.c”.

The file “lex.yy.c” is compiled by the C – Compiler and converted into a file

“a.out”. The C-compiler output is a working lexical analyzer that can take a stream

of input characters and produce a stream of tokens.

The attribute value, whether it be another numeric code, a pointer to the symbol

table, or nothing, is placed in a global variable yylval which is shared between the

lexical analyzer and parser

Figure 2.9: Creating a lexical analyzer with Lex

Structure of Lex Programs

A Lex program has the following form:

The declarations section includes declarations of variables, manifest constants (identifiers

declared to stand for a constant, e.g., the name of a token), and regular definitions.

The translation rules of lex program statement have the form Pattern { Action }

Pattern P1 { Action A1}

Pattern P2 { Action A2}

…

Pattern Pn { Action An}

Each pattern is a regular expression. The actions are fragments of code typically written in

C language.

The third section holds whatever additional functions are used in the actions. Alternatively,

these functions can be compiled separately and loaded with the

lexical analyzer.

The lexical analyzer begins reading its remaining input, one character at a time, until it finds

the longest prefix of the input that matches one of the patterns Pi. It then executes the associated

action Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes

whitespace or comments), then the lexical analyzer proceeds to find additional lexemes, until one

of the corresponding actions causes a return to the parser. The lexical analyzer returns a single

value, the token name, to the parser, but uses the shared, integer variable yylval to pass additional

information about the lexeme found.

declarations %% translation rules %%

auxiliary functions


2.7 DESIGN OF LEXICAL ANALYZER FOR A SAMPLE LANGUAGE

The lexical-analyzer generator such as Lex is architected with an automation simulator. The

implementation of Lex compiler can be based on either NFA or DFA.

2.7.1 The Structure of the Generated Analyzer

Figure 2.10 shows the architecture of a lexical analyzer generated by Lex. A Lex program is

converted into a transition table and actions which are used by a finite Automaton simulator.

The program that serves as the lexical analyzer includes a fixed program that simulates an

automaton; the automaton is deterministic or nondeterministic. The rest of the lexical analyzer

consists of components that are created from the Lex program by Lex itself.

Figure 2.10: A Lex program is turned into a transition table and actions, which are used by a finite-

automaton simulator

These components are:

1. A transition table for the automaton.

2. Those functions that are passed directly through Lex to the output.

3. The actions from the input program, which appear as fragments of code to be invoked at the

appropriate time by the automaton simulator.

2.7.2 Pattern Matching Based on NFA's

To construct the automation for several regular expressions, we need to combine all NFAs

into one by introducing a new start state with ε-transitions to each of the start states of the NFA's Ni

for pattern pi as shown in figure 2.11.

Figure 2.11: An NFA constructed from a Lex program

Example: Consider the atern


a { action Al for pattern pl }

abb { action A2 for pattern p2 }

a*b+ { action A3 for pattern p3}

Figure 2.12: NFA's for a, abb, and a*b+

Figure 2.13: Combined NFA

Figure 2.14: Sequence of sets of states entered when processing input aaba

Figure 2.15: Transition graph for DFA handling the patterns a, abb, and a*b+

CS6660 Compiler Design Unit III 3.1


3.1 NEED AND ROLE OF THE PARSER

The parser takes the token produced by lexical analysis and builds the syntax tree (parse tree). The syntax tree can be easily constructed from Context-Free Grammar.

The parser reports syntax errors in an intelligible fashion and recovers from commonly occurring errors to continue processing the remainder of the program.

Figure 3.1: Position of parser in compiler model

Role of the Parser:

Parser builds the parse tree.

Parser Performs context free syntax analysis.

Parser helps to construct intermediate code.

Parser produces appropriate error messages.

Parser attempts to correct few errors.

Types of parsers for grammars:

Universal parsers

Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar. These general methods are too inefficient to use in production. This method is not commonly used in compilers.

Top-down parsers

Top-down methods build parse trees from the top (root) to the bottom (leaves)

Bottom-up parsers.

Bottom-up methods start from the leaves and work their way up to the root.

3.2 CONTEXT FREE GRAMMARS

3.2.1 The Formal Definition of a Context-Free Grammar

A context-free grammar G is defined by the 4-tuple: G= (V, T, P S) where

1. V is a finite set of non-terminals (variable). 2. T is a finite set of terminals. 3. P is a finite set of production rules of the form Aα. Where A is nonterminal

and α is string of terminals and/or nonterminals. P is a relation from V to (V∪T)*.

4. S is the start symbol (variable S∈V).

Lexical

Analyzer Parser

Rest of

Front End

Source

program Get next

token

token Parse

tree

intermediate

representation

Symbol

Table

https://en.wikipedia.org/wiki/Tuple


Example 3.1: The following grammar defines simple arithmetic expressions. In this grammar, the terminal symbols are id + - * / ( ). The nonterminal symbols are expression, term and factor, and expression is the start symbol.

expression expression + term

expression expression - term

expression term

term term * factor

term term / factor

term factor

factor ( expression )

factor id

3.2.2 Notational Conventions

The following notational conventions for grammars can be used

1. These symbols are terminals: (a) Lowercase letters early in the alphabet, such as a, b, e. (b) Operator symbols such as +,*, and so on. (c) Punctuation symbols such as parentheses, comma, and so on. (d) The digits 0,1,. . . ,9. (e) Boldface strings such as id or if, each of which represents a single terminal symbol.

2. These symbols are nonterminals: (a) Uppercase letters early in the alphabet, such as A, B, C. (b) The letter S is usually the start symbol when which appears. (c) Lowercase, italic names such as expr or stmt. (d) When discussing programming constructs, uppercase letters may be used to represent

nonterminals for the constructs. For example, nonterminals for expressions, terms, and factors are often represented by E, T, and F, respectively.

3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either nonterminals or terminals.

4. Lowercase letters late in the alphabet, chiefly u, v, . . . , z, represent (possibly empty) strings of terminals.

5. Lowercase Greek letters, α, , for example, represent (possibly empty) strings of grammar symbols. Thus, a generic production can be written as A α, where A is the head and α the body.

6. A set of productions A α1 , A α2 ,…, A αk with a common head A (call them A-productions), may be written A α1 | α2 |…| αk. call α1, α2,…, αk the Alternatives for A.

7. Unless stated otherwise, the head of the first production is the start symbol.

Example 3.2 : Using these conventions, the grammar of Example 3.1 can be rewritten concisely as

E E + T | E - T | T

T T * F | T / F | F

F ( E ) | id


3.2.3 Derivations

The derivation uses productions to generate a string (set of terminals). The derivation is formed by replacing the nonterminal in the right hand side by suitable production rule.

The derivations are classified into two types based on the order of replacement of production. They are:

1. Leftmost derivation

If the leftmost non-terminal is replaced by its production in derivation, then it called leftmost derivation.

2. Rightmost derivation

If the rightmost non-terminal is replaced by its production in derivation, then it called rightmost derivation.

Example 3.3: LMD and RMD for example 3.2

LMD for - ( id + id )

E ⇒ - E ⇒ - ( E ) ⇒ - ( E + E ) ⇒ - ( id + E ) ⇒ - ( id + id )

RMD for - ( id + id )

E �⇒ - E �⇒ - ( E ) �⇒ - ( E + E ) �⇒ - ( E + id ) �⇒ - ( id + id )

Example 3.4: Consider the context free grammar (CFG) G = ({S}, {a, b, c}, P, S ) where P={SSbS | ScS | a}. Derive the string “abaca” by leftmost derivation and rightmost derivation.

Leftmost derivation for “abaca”

S ⇒ SbS ⇒ abS (using rule S a) ⇒ abScS (using rule S ScS ) ⇒ abacS (using rule S a) ⇒ abaca (using rule S a)

Rightmost derivation for “abaca”

S ⇒ ScS ⇒ Sca (using rule S a) ⇒ SbSca (using rule S SbS ) ⇒ Sbaca (using rule S a) ⇒ abaca (using rule S a)

3.2.4 Parse Trees and Derivations

A parse tree is a graphical representation of a derivation. t is convenient to see how strings are derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree.


Example 3.5: construction of parse tree for - ( id + id )

Derivation:

E ⇒ - E ⇒ - ( E ) ⇒ - ( E + E ) ⇒ - ( id + E ) ⇒ - ( id + id )

Parse tree:

Figure 3.2: Parse tree for -(id + id)

3.2.5 Ambiguity

A grammar that produces more than one parse tree for some sentence is said to be ambiguous. Put another way, an ambiguous grammar is one that produces more than one leftmost derivation or more than one rightmost derivation for the same sentence.

A grammar G is said to be ambiguous if it has more than one parse tree either in LMD or in RMD for at least one string.

Example 3.6: The arithmetic expression grammar (3.3) permits two distinct leftmost derivations for the sentence id + id * id:

E ⇒ E + E E ⇒ E * E

⇒ id + E ⇒ E + E * E

⇒ id + E * E ⇒ id + E * E

⇒ id + id * E ⇒ id + id * E

⇒ id + id * id ⇒ id + id * id

Figure 3.3: Two parse trees for id+id*id

E

E + E

id id

E

E *

id

+ E

id E

id

E

E *

id

E

⟹ E ⟹ E

E

⟹ E

E

) E (

E

E

) E (

E + E

⟹ E

E

) E (

E + E

id

E

E

) E (

E + E

id id

⟹


3.2.6 Verifying the Language Generated by a Grammar

A proof that a grammar G generates a language L has two parts: show that every string generated by G is in L, and conversely that every string in L can indeed be generated by G.

Example 3.7: Consider the following grammar S ( S ) S | . this simple grammar generates all strings of balanced parentheses. To show that every sentence derivable from S is balanced, we use an inductive proof on the number of steps n in a derivation.

BASIS: The basis is n = 1. The only string of terminals derivable from S in one step is the empty string, which surely is balanced.

INDUCTION: Now assume that all derivations of fewer than n steps produce balanced sentences, and consider a leftmost derivation of exactly n steps. Such a derivation must be of the form

The derivations of x and y from S take fewer than n steps, so by the inductive hypothesis x and y are balanced. Therefore, the string (x)y must be balanced.

That is, it has an equal number of left and right parentheses, and every prefix has at least as many left parentheses as right.

Having thus shown that any string derivable from S is balanced,

We must next show that every balanced string is derivable from S.

To do so, use induction on the length of a string.

BASIS: If the string is of length 0, it must be, which is balanced. INDUCTION: First, observe that every balanced string has even length. Assume that every balanced string of length less than 2n is derivable from S, and consider a balanced string w of length βn, n ≥1. Surely w begins with a left parenthesis. Let (x) be the shortest nonempty prefix of w having an equal number of left and right parentheses. Then w can be written as w = (x) y where both x and y are balanced. Since x and y are of length less than 2n, they are derivable from S by the inductive hypothesis. Thus, we can find a derivation of the form

Proved that w = (x)y is also derivable from S.

3.2.7 Context-Free Grammars versus Regular Expressions

Every regular language is a context-free language, but not vice-versa.

Example 3.8: The grammar for regular expression (a|b)*abb

A aA | bA | aB

B bC

C b

Describe the same language, the set of strings of a's and b's ending in abb. So we can easily describe these languages either by finite automata or PDA.

On the other hand, the language L = {anbn | n ≥1} with an equal number of a's and b's is a prototypical example of a language that can be described by a grammar but not by a regular expression. we can say that "finite automata cannot count" meaning that a finite automaton cannot accept a language like {anbn | n ≥ 1} that would require it to keep count of the number of a's before it sees the b's. So these kinds of languages (Context-Free Grammars) are accepted by PDA as PDA uses stack as its memory.


3.2.8 Left recursion

A context free grammar is said to be left recursive if it has a non terminal A with two productions in the following form.

A A α | Where α and are sequences of terminals and nonterminals that do not start with A.

Left recursion in top-down parsing can enter into infinite loop. It creates serious problems, so we have avoid Left recursion.

For example, in expr expr + term | term

Figure 3.4: Left-recursive and right recursive ways of generating a string

ALGORITHM 3.1 Eliminating left recursion.

INPUT: Grammar G with no cycles or - productions.

OUTPUT: An equivalent grammar with no left recursion.

METHOD: Apply the algorithm to G. Note that the resulting non-left-recursive grammar

may have -productions.

arrange the nonterminals in some order A1, Aβ, …, An. for ( each i from 1 to n ) {

for ( each j from 1 to i - 1 ) {

replace each production of the form Ai Aj by the

productions Ai 1 | 2 | … | k , where

Aj 1 | 2 | … | k are all current Aj-productions

}

eliminate the immediate left recursion among the Ai-productions

}

Note: Simply modify the left recursive production A A α | to A A'

A' α A' |

Example 3.9: Consider the grammar for arithmetic expressions.

E E + T | T

T T * F | F

E (E) | id


3.2.9 Left factoring

Left factoring is a process of factoring out the common prefixes of two or more production alternates for the same nonterminal.

Algorithm 3.2 : Left factoring a grammar.

INPUT: Grammar G.

OUTPUT: An equivalent left-factored grammar.

εETHOD: For each nonterminal A, find the longest prefix α common to two or more of its alternatives. If a ≠ - i.e., there is a nontrivial common prefix - replace all of the A-productions A α 1 | α 2 | … | α n| , where represents all alternatives that do not begin with α, by

A αA' |

A' 1 | 2 | … | n

Here A' is a new nonterminal. Repeatedly apply this transformation until no two alternatives for a nonterminal have a common prefix.

Example 3.13: Eliminate left factors from the given grammar. S T + S | T

After left factoring, the grammar becomes,

S T L

L + S |

Example 3.14: Left factor the following grammar. S i E t S | i E t S e S | a ; E b

After left factoring, the grammar becomes,

S i E t SS' | a

S' e S |

E b

Uses:

Left factoring is used in predictive top down parsing technique.


3.3 TOP DOWN PARSING -GENERAL STRATEGIES

Top-down parsing can be viewed as the problem of constructing a parse tree for the input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first ). Top-down parsing can be viewed as finding a leftmost derivation for an input string.

Parsers are generally distinguished by whether they work top-down (start with the grammar's start symbol and construct the parse tree from the top) or bottom-up (start with the terminal symbols that form the leaves of the parse tree and build the tree from the bottom). Top-down parsers include recursive-descent and LL parsers, while the most common forms of bottom-up parsers are LR parsers.

Figure 3.5: Types of parser

Example 3.15 : The sequence of parse trees for the input id+id*id in a top-down parse (LMD).

E TE'

E' + T E' | ε

T F T'

T' * F T' | ε

F ( E ) | id

Figure 3.6: Top-down parse for id + id * id

Types of parser

Bottom up parser Top down parser

Predictive parser Backtracking LR parser Shift Reduce parser

(C) LR parser LALR parser SLR parser LL(1) parser Recursive descent


3.4 RECURSIVE DESCENT PARSER

These parsers use a procedure for each nonterminal. The procedure looks at its input and decides which production to apply for its nonterminal. Terminals in the body of the production are matched to the input at the appropriate time, while nonterminals in the body result in calls to their procedure. Backtracking, in the case when the wrong production was chosen, is a possibility.

void A()

{

Choose an A-production, A X1 X2 . . . Xk;

for (i=l to k)

{

if ( Xi is a nonterminal )

call procedure Xi () ;

else if ( Xi equals the current input symbol a )

advance the input to the next symbol;

else /* an error has occurred */;

}

}

Example 3.16 : Consider the grammar

S c A d

A a b | a

To construct a parse tree top-down for the input string w = cad, begin with a tree consisting of a single node labeled S, and the input pointer pointing to c, the first symbol of w. S has only one production, so we use it to expand S and obtain the tree of Figure 3.7(a). The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the input pointer to a, the second symbol of w, and consider the next leaf, labeled A.

Now, we expand A using the first alternative A a b to obtain the tree of Figure 3.7(b). We have a match for the second input symbol, a, so we advance the input pointer to d, the third input symbol, and compare d against the next leaf, labeled b. Since b does not match d, we report failure and go back to A to see whether there is another alternative for A that has not been tried, but that might produce a match

Figure 3.7: Steps in a top-down parse

The second alternative for A produces the tree of Figure 3.7(c). The leaf a matches the second symbol of w and the leaf d matches the third symbol. Since we have produced a parse tree for w, we halt and announce successful completion of parsing.

a

S

c A d

S

c A d

a b

S

c A d

S

c A d


3.5 PREDICTIVE PARSER (NON RECURSIVE)

A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via recursive calls. The parser mimics a leftmost derivation. If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols α such that

S ∗⇒ wa

lm

The table-driven parser in Figure 3.8 has an input buffer, a stack containing a sequence of grammar symbols, a parsing table constructed, and an output stream. The input buffer contains the string to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of the stack, which initially contains the start symbol of the grammar on top of $.

The parser is controlled by a program that considers X, the symbol on top of the stack, and a, the current input symbol. If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a] of the parsing table M. Otherwise, it checks for a match between the terminal X and current input symbol a.

Figure 3.8: Model of a table-driven predictive parser

Algorithm 3.3 : Table-driven predictive parsing.

INPUT: A string w and a parsing table M for grammar G.

OUTPUT: If w is in L(G), a leftmost derivation of w; otherwise, an error indication.

METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start symbol S of G on top of the stack, above $. The following procedure uses the predictive parsing table M to produce a predictive parse for the input.

set ip to point to the first symbol of w;

set X to the top stack symbol;

while ( X ≠ $ ) { /* stack is not empty */ if ( X is a ) pop the stack and advance ip;

else if ( X is a terminal ) error();

else if ( M[X, a] is an error entry ) error();

else if ( M[X,a] = X Y1Y2…Yk) {

output the production X Y1Y2…Yk;

pop the stack;

push YkYk-1…Y1 onto the stack, with Yl on top;

}

set X to the top stack symbol;

}

Input

Stack

Predictive

Parsing

Program

Parsing

Table M

Output X

Y

Z

$

a + b $


Example 3.17: Consider grammar for the input id + id * id using the nonrecursive predictive parser.

E TE'

E' + T E' | ε

T F T'

T' * F T' | ε

F ( E ) | id

� ⇒ TE′ ⇒ FT′E′ ⇒ idT′E′ ⇒ idE′ ⇒ id+TE′ ⇒ id+FT'E′ ⇒ id+idT'E′

⇒ id+id*FT'E′ ⇒ id+id*idT'E′ ⇒ id+id*idE′ ⇒ id+id*id

Figure 3.9: Moves made by a predictive parser on input id + id * id

3.6 LL(1) PARSER

A grammar such that it is possible to choose the correct production with which to expand a given nonterminal, looking only at the next input symbol, is called LL(1). These grammars allow us to construct a predictive parsing table that gives, for each nonterminal and each lookahead symbol, the correct choice of production. Error correction can be facilitated by placing error routines in some or all of the table entries that have no legitimate production.

LL(1) Grammars

Predictive parsers (recursive-descent parsers) needing no backtracking, can be constructed for a class of grammars called LL(1). The first "L" in LL(1) stands for scanning the input from left to right, the second "L" for producing a leftmost derivation, and the "1" for using one input symbol of lookahead at each step to make parsing action decisions.


Transition Diagrams for Predictive Parsers

Transition diagrams are useful for visualizing predictive parsers. To construct the transition diagram from a grammar, first eliminate left recursion and then left factor the grammar. Then, for each nonterminal A,

1. Create an initial and final (return) state. 2. For each production A X1X2…Xk, create a path from the initial to the final state, with

edges labeled X1, X2, …, Xk. If A , the path is an edge labeled .

A grammar G is LL(1) if and only if whenever A α | are two distinct productions of G, the following conditions hold:

1. For no terminal a do both α and derive strings beginning with a. 2. At most one of α and can derive the empty string. 3. If ∗⇒ then α does not derive any string beginning with a terminal in FOLLOW(A).

δikewise, if α ∗⇒ , then does not derive any string beginning with a terminal in FOLLOW(A).

Predictive parsers can be constructed for LL(1) grammar since the proper production to apply for a nonterminal can be selected by looking only at the current input symbol. Flow-of-control constructs with their distinguishing keywords generally satisfy the LL(1) constraints. For instance,

Stmt if ( expr ) stmt else stmt | while ( expr ) stmt | { stmt_list }

For the above productions the keywords if, while, and symbol { tell us which alternateive is only one that could possibly succeed.

To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or E: can be added to any FIRST set.

1. If X is a terminal, then FIRST(X) = {X}. 2. If X is a nonterminal and X YlY2…Yk is a production for some k≥1, then place a in

FIRST(X) if for some i, a is in FIRST(Yi), and is in all of FIRST(Y1),…, FIRST(Yi-1); that is, Yl… Yi-1 =>* . (If is in FIRST(Yj) for all j = 1,β, . . . , k, then add to FIRST(X). For example, everything in FIRST(Yl) is surely in FIRST(X). If Yl does not derive , then we add nothing more to FIRST(X), but if Yl =>* , then we add F1RST(Y2), and So on.)

3. If X is a production, then add to FIRST(X).

To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing can be added to any FOLLOW set.

1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker. 2. If there is a production AαBβ, then everything in FIRST( ) except is in FOδδOW(B). 3. If there is a production AαB, or a production AαBβ, where FIRST(β) contains , then

everything in FOLLOW (A) is in FOLLOW (B).

Example 3.18: Construct the Predictive parsing table for LL(1) grammar:

S iEtSS' | a

S' eS |

E b

The LL(1) grammar will not be left recursive. So find the FIRST() and FOLLOW().


FIRST (): FIRST(S) ={i,a}, FIRST(S') ={e, }, FIRST(E) ={b}

FOLLOW(): FOLLOW(S)={e, $}, FOLLOW(S')={ e, $}, FOLLOW(E)={t, $}

NON - TERMINAL

INPUT SYMBOL

a b e i t $

S S a S iEtSS'

S' S' eS S' ε

S' ε

E E b

3.7 SHIFT REDUCE PARSER

Bottom-up parsers generally operate by choosing, on the basis of the next input symbol (lookahead symbol) and the contents of the stack, whether to shift the next input onto the stack, or to reduce some symbols at the top of the stack. A reduce step takes a production body at the top of the stack and replaces it by the head of the production.

Example 3.19: Consider the production rules for the shift-reduce parser on input id *id.

E E + T | T

T T * F | F

F (E) | id

STACK INPUT ACTION

$ id1 * id2 $ shift

$id1 * id2 $ reduce by F id

$F * id2 $ reduce by T F

$T * id2 $ shift

$T* id2 $ shift

$T*id2 $ reduce by F id

$T*F $ reduce by T T * F

$T $ reduce by E T

$E $ accept

The actions of a shift-reduce parser on input id *id, using the LR(0) automaton. We use a stack to hold states, the grammar symbols corresponding to the states on the stack appear in column SYMBOLS. At line (1), the stack holds the start state 0 of the automaton; the corresponding symbol is the bottom-of-stack marker $.


3.8 LR PARSER

A schematic of an LR parser is shown in Figure 3.10. It consists of an input, an output, a stack, a driver program, and a parsing table that has two pasts (ACTION and GOTO). The driver program is the same for all LR parsers; only the parsing table changes from one parser to another. The parsing program reads characters from an input buffer one at a time. Where a shift-reduce parser would shift a symbol, an LR parser shifts a state. Each state summarizes the information contained in the stack below it.

Figure 3.10: Model of an LR parser

The stack holds a sequence of states, s0s1…sm where sm, is on top. In the SLR method, the stack holds states from the LR(0) automaton; the canonical- LR and LALR methods are similar.

Structure of the LR Parsing Table

The parsing table consists of two parts: a parsing-action function ACTION and a goto function GOTO.

1. The ACTION function takes as arguments a state i and a terminal a (or $, the input end marker). The value of ACTION[i, a] can have one of four forms:

(a) Shift j, where j is a state. The action taken by the parser effectively shifts input a to the stack, but uses state j to represent a.

(b) Reduce A β. The action of the parser effectively reduces β on the top of the stack to head A.

(c) Accept. The parser accepts the input and finishes parsing. (d) Error. The parser discovers an error in its input and takes some corrective action.

2. We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii, A] = Ij, then GOTO also maps a state i and a nonterminal A to state j.

Algorithm 3.4: LR-parsing algorithm.

INPUT: An input string w and an LR-parsing table with functions ACTION and GOTO for a

grammar G.

OUTPUT: If w is in L(G), the reduction steps of a bottom-up parse for w; otherwise, an error

indication.

METHOD: Initially, the parser has so on its stack, where so is the initial state, and w$ in the input

buffer.

let a be the first symbol of w$;

while(1)

Input

Stack

LR Parsing Program

ACTION

Output Sm

Sm-1

…

$

+ an $ a1 … ai

GOTO


{ /* repeat forever */

let s be the state on top of the stack;

if ( ACTION[S, a] = shift t )

{

push t onto the stack;

let a be the next input symbol;

}

else if ( ACTION[S, a] = reduce A β)

{

pop | β| symbols off the stack;

let state t now be on top of the stack;

push GOTO[t, A] onto the stack;

output the production A β;

}

else if ( ACTION[S, a] = accept ) break; /* parsing is done */

else call error-recovery routine;

}

Example 3.20: The Figure 4.37 shows the ACTION and GOT0 functions of an LR-parsing table for the expression grammar

E E + T | T

T T * F | F

E (E) | id

STATE ACTION GOTO

id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 accept

2 r2 s7 r2 r2

3 r4 r4 r4 r4

4 s5 s4 8 2 3

5 r6 r6 r6 r6

6 s5 s4 9 3

7 s5 s4 10

8 s6 s1

9 r1 r1 r1

10 r3 r3 r3 r3

11 r5 r5 r5 r5

Figure 3.11 Parsing table for expression grammar


3.9 LR (0) ITEM

An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse. States represent sets of "items". An LR(0) item (item for short) of a grammar G is a production of G with a dot at some position of the body. Thus, production A XYZ yields the four items.

A •X Y Z

A X•Y Z

A X Y•Z

A X Y Z• The production A generates only one item, A •.

An item indicates how much of a production we have seen at a given point in the parsing process.

For example, the item A •XYZ indicates that we hope to see a string derivable from XYZ next on the input. Item

A X•Y Z indicates that we have just seen on the input a string derivable from X and that we hope next to see a string derivable from Y Z.

Item A X Y Z• indicates that we have seen the body XYZ and that it may be time to reduce XYZ to A.

One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the basis for constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton is called an LR(0) automaton.

To construct the canonical LR(0) collection for a grammar, we define an augmented grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol S, then G', the augmented grammar for G, is G with a new start symbol S' and production S' S. The purpose of this new starting production is to indicate to the parser when it should stop parsing and announce acceptance of the input. That is, acceptance occurs only when the parser is about to reduce by S' S.

Closure of Item Sets

If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from I by the two rules:

1. Initially, add every item in I to CLOSURE(I). 2. If A α•Bβ is in CLOSURE(I) and B is a production, then add the item B to

CLOSURE(I), if it is not already there. Apply this rule until no more new items can be added to CLOSURE(I)

Intuitively, A α•Bβ in CLOSURE(I) indicates that, at some point in the parsing process, we think we might next see a substring derivable from Bβ as input. The substring derivable from Bβ will have a prefix derivable from B by applying one of the B-productions. We therefore add items for all the B-productions; that is, if B is a production, we also include B • in CLOSURE(I).

A convenient way to implement the function closure is to keep a boolean array added, indexed by the nonterminals of G, such that added[B] is set to true if and when we add the item B for each B-production B • .

We can divide all the sets of items of interest into two classes. They are:

1. Kernel items: the initial item, S' •S, and all items whose dots are not at the left end.

2. Nonkernel items: all items with their dots at the left end, except for S' •S


SetOfItems CLOSURE(I )

{

J = I;

repeat

for ( each item A α•Bβ in J )

for ( each production B of G ) if (B • is not in J )

add B • to J; until no more items are added to J on one round;

return J;

}

Figure 3.32: Computation of CLOSURE

Example 4.21 : Consider the augmented expression grammar:

E' E

E E + T | T

T T * F | F

E (E) | id

If I is the set of one item {[E' •E]}, then CLOSURE(I) contains the set of items I0 in Figure.

E' •E

E •E + T

E •T

T •T * F

T •F

E •(E)

E •id

3.10 CONSTRUCTION OF SLR PARSING TABLE

The SLR method for constructing parsing tables is a good LR parsing. the parsing table constructed by this LR parser using an SLR-parsing table called SLR parser. The other two methods augment the SLR method with look-ahead information. The SLR method begins with LR(0) items and LR(0) automata.

Given a grammar, G, we augment G' to produce G, with a new start symbol S'. From G', we construct C, the canonical collection of sets of items for G' together with the GOTO function.

Algorithm 3.5: Constructing an SLR-parsing table.

INPUT: An augmented grammar G'.

OUTPUT: The SLR-parsing table functions ACTION and GOTO for G'.

METHOD:

1. Construct C = {I0, I1, . . . , In}, the collection of sets of LR(0) items for G'.

2. State i is constructed from Ii. The parsing actions for state i are determined as follows:

(a) If [A α•aβ] is in Ii, and GOTO(Ii , a ) = Ij, then set ACTION[i , a] to "shift j". Here a must be a terminal.


(b) If [A α•] is in Ii, then set ACTION[i, a] to "reduce A α" for all a in FOLLOW(A); here A may not be S'.

(c) If [S' S•] is in Ii, then set ACTION[i, $] to "accept".

If any conflicting actions result from the above rules, we say the grammar is not SLR(1). The algorithm fails to produce a parser in this case.

3. The goto transitions for state i are constructed for all nonterminals A using the rule: If GOTO(Ii, A) = Ij, then GOTO[i, A] = j.

4. All entries not defined by rules (2) and (3) are made "error".

5. The initial state of the parser is the one constructed from the set of items containing [S' •S].

An LR parser using the SLR(1) table for G is called the SLR(1) parser for G, and a grammar having an SLR(1) parsing table is said to be SLR(1). We usually omit the "(1)" after the "SLR," since we shall not deal here with parsers having more than one symbol of lookahead.

Example 3.22: Let us construct the SLR table for the augmented expression grammar. The canonical collection of sets of LR(0) items for the following grammar.

E E + T | T

T T * F | F

F (E) | id

Step1: Augment E' production

E' E (Accept - r0)

E E + T (r1)

E T (r2)

T T * F (r3)

T F (r4)

F (E) (r5)

F id (r6)

Step2: Closure of E'

The set of items I0 :

E' •E

E •E + T

E •T

T •T * F

T •F

F •(E)

F •id

Step3: GOTO operation of every symbol on I0 items:

Goto(I0, E): I1

E' E• E E• + T

Goto(I0, T): I2

E T• T T• * F

Goto(I0, F): I3

T F•

Goto(I0, ( ): I4

E (•E)

E •E + T

E •T

T •T * F

T •F

F •(E)

F •id

Goto(I0, id): I5

F id•

Goto(I1, +): I6

E E + •T

T •T * F

T •F

F •(E)

F •id

Goto(I2, *): I7

T T * •F

F •(E)

F •id

Goto(I4, E): I8

F (E•) E E •+ T

Goto(I6, T): I9

E E + T• T T• * F

Goto(I7, F): I10

T T * F•

Goto(I8, )): I11

F (E) •


Step4: Construction of DFA

Figure 3.12: LR(0) automaton – DFA with every state as final and I0 as initial.

Step5: Construction of FOLLOW SET for nonterminals

FOLLOW (E') ={$} because E' is a start symbol.

FOLLOW (E):

E' E i.e., follow of E’ is E, so add $ because E’ is start symbol

E E + T i.e., follow of E is +, so add +

F (E) i.e., follow of E is ), so add )

FOLLOW (E) = {+, ), $}

FOLLOW (T):

As E' E, E T .i.e., E' = E = T = start symbol. add $

E E + T T + T i.e., follow of T is +, so add +

T T * F i.e., follow of T is *, so add *

F (E) i.e., follow of T is ), so add )

FOLLOW (T) = {+, *, ), $}


FOLLOW (F):

E' E, E T, T F .i.e., E' = E = T = F = start symbol. add $

E E + T T + T F + T i.e., follow of F is +, so add +

T T * F F * F T i.e., follow of F is *, so add *

F (E) (T) (F) i.e., follow of F is ), so add )

FOLLOW (F) = {+, *, ), $}

Step6: Construction

SLR-parsing table will be constructed using the Algorithm 3.5 Constructing an SLR-parsing table.

Step7: Table filling

First consider the set of items I0:

The item F •(E) gives rise to the entry in action table ACTION[0, (] = shift 4, , Goto(I0, ( ): I4. The item F •id gives rise to the entry in action table ACTION[0, id] = shift 5, , Goto(I0, id): I4.

Other items in I0 yield no actions.

Now consider I1 : E' E• and E E• + T

The first item yields ACTION[1, $] = accept, and the second yields ACTION[1, +] = = shift 6.

Next consider I2: E T• and T T• * F

Since FOLLOW(E)= {$, +, ) }, the first item makes

ACTION[2, $] = ACTION[2, +] = ACTION[2, )] = reduce E T

The second item makes ACTION[2, *] = shift 7. And so on.

STATE ACTION GOTO

id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 accept

2 r2 s7 r2 r2

3 r4 r4 r4 r4

4 s5 s4 8 2 3

5 r6 r6 r6 r6

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r1 s7 r1 r1

10 r3 r3 r3 r3

11 r5 r5 r5 r5


Step8: Input parsing

LINE STACK SYMBOLS INPUT ACTION

(1) 0 id * id + id $ shift

(2) 0 5 id * id + id $ reduce by F id

(3) 0 3 F * id + id $ reduce by T F

(4) 0 2 T * id + id $ shift

(5) 0 2 7 T * id + id $ shift

(6) 0 2 7 5 T * id + id $ reduce by F id

(7) 0 2 7 5 10 T * F + id $ reduce by T T*F

(8) 0 2 T + id $ reduce by E T

(9) 0 1 E + id $ shift

(10) 0 1 6 E + id $ shift

(11) 0 1 6 5 E + id $ reduce by F id

(12) 0 1 6 3 E + F $ reduce by T F

(13) 0 1 6 9 E + T $ reduce by E T+T

(14) 0 1 E $ accept

At line (1), the stack holds the start state 0 of the automaton; the corresponding symbol is the bottom-of-stack marker $. The next input symbol is id and state 0 has a transition on id to state 5. We therefore shift. At line (2), state 5 (symbol id) has been pushed onto the stack. There is no transition from state 5 on input *, so we reduce. From item [F id] in state 5, the reduction is by production F id.

With symbols, a reduction is implemented by popping the body of the production from the stack (on line (2), the body is id) and pushing the head of the production (in this case, F). With states, we pop state 5 for symbol id, which brings state 0 to the top and look for a transition on F, the head of the production.

3.11 INTRODUCTION TO LALR PARSER

The LALR (Eoolcahead-LR) technique is often used in practice, because the tables obtained by LALR parser are considerably smaller than the canonical LR parser tables.

For a comparison of parser size, the SLR and LALR tables for a grammar always have the same number of states, and this number is typically several hundred states for a language like C. The canonical LR table would typically have several thousand states for the same-size language. Thus, it is much easier and more economical to construct SLR and LALR tables than the canonical LR tables.

Algorithm 3.6:: An easy, but space-consuming LALR table construction.


OUTPUT: The LALR parsing-table functions ACTION and GOT0 for G'.

METHOD:

1. Construct C = (I0, I1,…, In), the collection of sets of LR(1) items. 2. For each core present among the set of LR(1) items, find all sets having that core, and

replace these sets by their union. 3. Let C' = {J0, J1,…,Jm} be the resulting sets of LR(1) items. The parsing actions for state i are

constructed from Ji. If there is a parsing action conflict, the algorithm fails to produce a parser, and the grammar is said not to be LALR(1).


4. The GOTO table is constructed as follows. If J is the union of one or more sets of LR(1) items, that is, J = I1 ∩ I2 ∩ … ∩Ik, then the cores of GOTO(I1, X) , GOTO(I2, X) , … , GOTO(In, X) are the same, since I1, I2, … , Ik, all have the same core. Let K be the union of all sets of items having the same core as GOTO(I1, X) . Then GOTO(J, X) = K.

Algorithm 3.7: Construction of the sets of LR(1) items.


OUTPUT: The sets of LR(1) items that are the set of items valid for one or more viable prefixes of G'.

METHOD: The procedures CLOSURE and GOT0 and the main routine items for constructing the sets of items were given below.

SetOfftems CLOSURE(I)

{

repeat

for ( each item [A α•Bβ, a] in I )

for ( each production B in G' )

for ( each terminal b in FIRST(βa) )

add [B • , b ] to set I;

until no more items are added to I;

return I;

}

SetOfItems GOTO(I, X) {

initialize J to be the empty set;

for ( each item [A α•Xβ, a] in I )

add item [AαX•β, a] to set J;

return CLOSURE(J);

}

void items(G')

{

initialize C to CLOSURE({[S' •S, $]});

repeat

for ( each set of items I in C )

for ( each grammar symbol X )

if ( GOTO(I, X ) is not empty and not in C )

add GOTO(I, X ) to C;

until no new sets of items are added to C;

}

Example 3.23: Consider the following augmented grammar.

S' S

S C C

C c C | d

Construct parsing table for LALR(1) parser.


Constrcution of Set of LR(1) items.

I0:

S •S, $

S •CC, $

C •cC, c/d

C •d, c/d

I1: GOTO(I0, S)

S S•, $

I2: GOTO(I0, C)

S C•C, $

C •cC, c/d C •d, c/d

I3: GOTO(I0, c)

C c•C, c/d C •cC, c/d C •d, c/d

I4: GOTO(I0, d)

C d•, c/d

I5: GOTO(I2, C)

S CC•, $

I6: goto(I2, c)

C c•C, $ C •cC, $ C •d, $

I7: GOTO(I2, d)

C d•, $

I8: GOTO(I3, C)

C cC•, c/d

I9: GOTO(I6, C)

C cC•, $

Figure 3.13: The GOTO graph for the above grammar


There are three pairs of sets of items that can be merged. I3 and I6 are replaced by their union:

I36: GOTO(I0, c)

C c•C, c/d/$ C •cC, c/d/$ C •d, c/d/$

I4 and I7 are replaced by their union:

I47: GOTO(I0, d)

C d•, c/d/$

I8 and I9 are replaced by their union:

I8: GOTO(I3, C)

C cC•, c/d/$

The LALR ACTION and GOTO functions for the condensed sets of items are shown in table 3.4

STATE ACTION GOTO

c d $ S C

0 s36 s47 1 2

1 Accept

2 s36 s47 5

36 s36 s47 89

47 r3 r3 r3

5 r1

89 r2 r2 r2

Parsing the input string “ccdd”

Stack Input buffer Action table Goto table Parsing action

$0 ccdd$ action[0, c]=s36

$0c36 cdd$ action[36, c]=s36 Shift

$0c36c36 dd$ action[36, d]=s47 Shift

$0c36c36d47 d$ action[47, d]=r36 [36,C]=89 Reduce by C d

$0c36c36C89 d$ action[89, d]=r2 [36,C]=89 Reduce by C cC

$0c36C89 d$ action[89, d]=r2 [0, C]=2 Reduce by C cC

$0C2 d$ action[2, d]=s47 Shift

$0C2d47 $ action[47, $]=r36 [2, C]=5 Reduce by C d

$0C2C5 $ action[5, $]=r1 [0, S]=1 Reduce by S CC

$0S1 $ accept


3.12 ERROR HANDLING AND RECOVERY IN SYNTAX ANALYZER

Syntax Error Handling

If a compiler had to process only correct programs, its design and implementation would be simplified greatly. However, a compiler is expected to assist the programmer in locating and tracking down errors that inevitably creep into programs, despite the programmer's best efforts.

Few languages have been designed with error handling in mind, even though errors are so commonplace.

Most programming language specifications do not describe how a compiler should respond to errors; error handling is left to the compiler designer.

Planning the error handling right from the start can both simplify the structure of a compiler and improve its handling of errors.

Common programming errors can occur at many different levels.

Lexical errors include misspellings of identifiers, keywords, or operators - e.g., the use of an identifier elipsesize instead of ellipsesize – and missing quotes around text intended as a string.

Syntactic errors include misplaced semicolons or extra or missing braces; that is, '((" or ")." As another example, in C or Java, the appearance of a case statement without an enclosing switch is a syntactic.

Semantic errors include type mismatches between operators and operands. An example is a return statement in a Java method with result type void.

Logical errors can be anything from incorrect reasoning on the part of the programmer to the use in a C program of the assignment operator = instead of the comparison operator ==.

Syntactic errors are detected very efficiently by syntax analyzers. Several parsing methods, such as the LL and LR methods, detect an error as soon as possible.

Another reason for emphasizing error recovery during parsing is that many errors appear syntactic, whatever their cause, and are exposed when parsing cannot continue.

A few semantic errors, such as type mismatches, can also be detected efficiently; however, accurate detection of semantic and logical errors at compile time is a difficult task .in general.

The error handler in a parser has goals that are simple to state but challenging to realize:

Report the presence of errors clearly and accurately.

Recover from each error quickly enough to detect subsequent errors.

Add minimal overhead to the processing of correct programs.

A common strategy is to print the offending line with a pointer to the position at which an error is detected.

Error-Recovery Strategies

Once an error is detected, it should be recovered by parser. The simplest approach is for the parser to quit with an informative error message when it detects the first error.

Additional errors are often uncovered if the parser can restore itself to a state where processing of the input can continue.

1. Panic-Mode Recovery

With this method, on discovering an error, the parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found. The synchronizing tokens are usually delimiters, such as semicolon or }, whose role in the source program is clear and unambiguous.


While panic-mode correction often skips a considerable amount of input without checking it for additional errors, it has the advantage of simplicity.

2. Phrase-Level Recovery

On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. A typical local correction is to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon. The choice of the local correction is left to the compiler designer.

Phrase-level replacement has been used in several error-repairing compilers, as it can correct any input string. Its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection (must avoid infinite loops).

3. Error Productions

By anticipating common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs.

A parser constructed from a grammar augmented by these error productions detects the anticipated errors when an error production is used during parsing. The parser can then generate appropriate error diagnostics about the erroneous construct that has been recognized in the input.

4. Global Correction

A compiler makes few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction.

Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related string y, such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest.

3.13 YACC

YACC is an acronym for "Yet Another Compiler Compiler". It was originally developed in the early 1970s by Stephen C. Johnson at AT&T Corporation and written in the B programming language, but soon rewritten in C.It appeared as part of Version 3 Unix, and a full description of Yacc was published in 1975.

Yacc is a computer program for the Unix operating system. It is a LALR parser generator, generating a parser, the part of a compiler that tries to make syntactic sense of the source code, specifically a LALR parser, based on an analytic grammar written in a notation similar to BNF.

Yacc itself used to be available as the default parser generator on most Unix systems. The input to Yacc is a grammar with snippets of C code (called "actions") attached to its rules. Its output is a shift-reduce parser in C that executes the C snippets associated with each rule as soon as the rule is recognized. Typical actions involve the construction of parse trees.


declarations

%%

translation rules

%%

supporting C routine

Figure 3.14: Creating an input/output translator with Yacc

Yacc specification of a simple desk calculator

%{

#include <ctype.h>

%}

%token DIGIT

%%

line : expr ‘\n’ { printf( “%d\n”, $1) ; }

;

expr : expr ‘+’ term ( $$ = $1 + $3; }

| term

;

term : term ‘*’ factor { $$ = $1 * $γ; }

| factor

;

factor : ‘(’ expr ‘)’ ( $$ = $2; }

: DIGIT

;

%%

yylex()

{

int c;

c = getchar () ;

if (isdigit(c) )

{

yylval = c-‘0’); return DIGIT;

}

return c;

}

Yacc

compiler

Yacc Specification translate.y y.tab.c

C

compiler a.out y.tab.c

a.out output input


3.14 DESIGN OF A SYNTAX ANALYZER FOR A SAMPLE LANGUAGE

Figure 3.15: Design of Syntax Analyzer with Lex and Yacc

YACC (Yet Another Compiler Compiler).

• Automatically generate a parser for a context free grammar (LALR parser)

– Allows syntax direct translation by writing grammar productions and semantic actions

– LALR(1) is more powerful than LL(1).

• Work with lex. YACC calls yylex to get the next token.

– YACC and lex must agree on the values for each token.

• Like lex, YACC pre-dated c++, need workaround for some constructs when using c++ (will give an example).

• YACC file format:

declarations /* specify tokens, and non-terminals */

%%

translation rules /* specify grammar here */

%%

supporting C-routines

• Command “yacc yaccfile” produces y.tab.c, which contains a routine yyparse().

• yyparse() calls yylex() to get tokens.

• yyparse() returns 0 if the program is grammatically correct, non-zero otherwise

• YACC automatically builds a parser for the grammar (LALR parser).

• May have shift/reduce and reduce/reduce conflicts when the grammar is not LALR • In this case, you will need to modify grammar to make it LALR in order for yacc to

work properly.

• YACC tries to resolve conflicts automatically

• Default conflict resolution:

• shift/reduce --> shift

• reduce/reduce --> first production in the state

Source

files

Program

Generator

Generated

Output files Compiler Compiled

Program

In execution

Generating

Output

Input

Parsed

Output

Grammar

Rules yacc

y.tab.c

Lexical

Rules Lex

Lex.yy.c

cc a.out


Program to recognize a valid variable (identifier) which starts with a letter followed by any number of

letters or digits.

LEX

%{

#include"y.tab.h"

extern yylval;

%}

%%

[0-9]+ {yylval=atoi(yytext); return DIGIT;}

[a-zA-Z]+ {return LETTER;}

[\t] ;

\n return 0;

. {return yytext[0];}

%%

YACC

%{

#include<stdio.h>

%}

%token LETTER DIGIT

%%

variable: LETTER|LETTER rest

;

rest: LETTER rest

|DIGIT rest

|LETTER

|DIGIT

;

%%

main()

{

yyparse();

printf("The string is a valid variable\n");

}

int yyerror(char *s)

{

printf("this is not a valid variable\n");

exit(0);

}

OUTPUT

$lex p4b.l

$yacc –d p4b.y

$cc lex.yy.c y.tab.c –ll

$./a.out

input34

The string is a valid variable

$./a.out

89file

This is not a valid variable

CS6660 Compiler Design Unit IV 4.1


4.1 SYNTAX DIRECTED DEFINITIONS

A syntax-directed definition(SSD) is a generalization of a context-free grammar in which each grammar symbol has an associated set of attributes, partitioned into two subsets called the synthesized and inherited attributes of that grammar symbol.

An attribute can represent a string, a number, a type, a memory location, or whatever. The value of an attribute at a parse-tree node is defined by a semantic rule associated with the production used at that node.

The value of a synthesized attribute at a node is computed from the values of attributes at the children of that node in the parse tree; the value of an inherited attribute is computed from the values of attributes at the siblings and parent of that node.

Example 4.1: The syntax-directed definition in Figure 4.1 is for a desk calculator program, This definition associates an integer-valued synthesized attribute called val with each of the nonterminals E, T, and F. For each E, T, and F-production. the semantic rule computes the value of attribute val for the nonterminal on the left side from the values of val for the nonterminals on the right side.

Production Semantic rule

L En Print(E.val)

E E1 + T E.val := E1.val + T.val

E T E.val := T.val

T T1 * F T.val := T1.val * F.val

T F T.val := F.val

F (E) F.val := E.val

F digit F.val := digit.lexval

Example 4.2: An annotated parse tree for the input string 3 * 5 + 4 n,

Figure 4.1: Annotated parse tree for 3 * 5 + 4 n


4.2 CONSTRUCTION OF SYNTAX TREE

Syntax directed definitions are very useful for construction of syntax trees. each node in a syntax tree represents a construct; the children of the node represent the meaningful components of the construct. A syntax-tree node representing an expression El + E2 has label + and two children representing the subexpressions El and E2.

The nodes of a syntax tree are implemented by objects with a suitable number of fields. Each object will have an op field that is the label of the node.

The objects will have additional fields as follows:

If the node is a leaf, an additional field holds the lexical value for the leaf. A constructor function Leaf( op, val) creates a leaf object. Alternatively, if nodes are viewed as records, then Leaf returns a pointer to a new record for a leaf.

If the node is an interior node, there are as many additional fields as the node has children in the syntax tree. A constructor function Node takes two or more arguments: Node(op, c1, c2, . . . , ck) creates an object with first field op and k additional fields for the k children c1, . . . , ck.

Example 4.3: The S-attributed definition in Figure constructs syntax trees for a simple expression grammar involving only the binary operators + and -. As usual, these operators are at the same precedence level and are jointly left associative. All nonterminals have one synthesized attribute node, which represents a node of the syntax tree.

Every time the first production E El + T is used, its rule creates a node with '+' for op and two children, El.node and T.node, for the subexpressions. The second production has a similar rule.

S. No.

PRODUCTION SEMANTIC RULES

(1) E El + T E.node = new Node('+', El.node, T.node)

(2) E El - T E.node = new Node('-', El.node, T.node)

(3) E T E.node = T.node

(4) T (E) T.node = E.node

(5) T id T.node = new Leaf(id, id. entry)

(6) T num T. node = new Leaf (num, num. val)

Example 4.3: The L-attributed definition in Figure 4.2 performs the same translation as the S-attributed definition in Example 4.3. The attributes for the grammar symbols E, T, id, and nurn are as discussed in Example 4.3.

1) pl = new Leaf(id, entry-a);

2) p2 = new Leaf (num, 4);

3) p3 = new Node('-', pl, p2);

4) p4 = new Leaf(id, entry-c);

5) p5 = new Node('+', p3,p4);


Figure 4.2: Syntax tree for a - 4 + c

Example 4.4: In C, the type int [2][3] can be read as, "array of 2 arrays of 3 integers." The corresponding type expression array(2, array(3, integer)) is represented by the tree.

The operator array takes two parameters, a number and a type. If types are represented by trees, then this operator returns a tree node labeled array with two children for a number and a type.

Figure 4.3: Type expression for int [2] [3]

4.3 BOTTOM-UP EVALUATION OF S-ATTRIBUTE DEFINITIONS

An attribute grammar is a formal way to define attributes for the productions of a formal grammar, associating these attributes to values. The evaluation occurs in the nodes of the abstract syntax tree, when the language is processed by some parser or compiler.

The attributes are divided into two groups: synthesized attributes and inherited attributes. The synthesized attributes are the result of the attribute evaluation rules, and may also use the values of the inherited attributes. The inherited attributes are passed down from parent nodes.

In some approaches, synthesized attributes are used to pass semantic information up the parse tree, while inherited attributes help pass semantic information down and across it.

Syntax-directed definitions with only synthesized attributes are called S-attributes. This is commonly used in LR parsers.

The implementation is done by using a stack to hold information about subtrees that have been parsed.

array

array

integer

2

3


A translator for an arbitrary syntax-directed definition can be difficult to build. However, there are large classes of useful syntax-directed definitions for which it is easy to construct translators.

Only synthesized attributes appear in the syntax-directed definition in the following table for constructing the syntax tree for an expression.

S. No. PRODUCTION SEMANTIC RULES

(1) E El + T E.node = new Node('+', El.node, T.node)

(2) E El - T E.node = new Node('-', El.node, T.node)

(3) E T E.node = T.node

(4) T (E) T.node = E.node

(5) T id T.node = new Leaf(id, id. entry)

(6) T num T. node = new Leaf (num, num. val)

This approach can be applied to construct syntax trees during bottom-up parsing. the translation of expressions during top-down parsing often uses inherited attributes.

Synthesized Attributes on the Parser Stack

A translator for an S-attributed definition can often be implemented with the help of an LR-parser generator.

From an S-attributed definition, the parser generator can construct a translator that evaluates attributes as it parses the input.

A bottom-up parser uses a stack to hold information about subtrees that have been parsed. We can use extra fields in the parser stack to hold the values of synthesized attributes.

State Value

… …

X X.x

Y Y.y

Z Z.z

… …

The above table shows an example of a parser stack with space for one attribute value. The stack is implemented by a pair of arrays state and value. Each state entry is a pointer (or index) to an LR(1) parsing table, If the ith state symbol is A, then value[i] will hold the value of the attribute associated with the parse tree node corresponding to this A.

The current top of the stack is indicated by the pointer top. We assume that synthesized attributes are evaluated just before each reduction. Suppose the semantic rule A.a := f (X.x, Y.y, Z.z) is associated with the production A XYZ. Before XYZ is reduced to A, the value of the attribute Z.z is in value[top], that of Y.y in value[top-1], and that of X.x in value[top-2].

If a symbol has no attribute, then the corresponding entry in the value array is undefined. After the reduction, top is decremented by 2, the state covering A is put in state[top] (i.e., where X was), and the value of the synthesized attribute A.a is put in value [top].

Example 4.4: Consider the syntax-directed definition of the desk calculator in the following table.


L En Print(E.val)

top


E E1 + T E.val := E1.val + T.val

E T E.val := T.val

T T1 * F T.val := T1.val * F.val

T F T.val := F.val

F (E) F.val := E.val


Figure 4.4: Annotated parse tree for 3 * 5 + 4 n

Implementation of a desk calculator with an LR parser is given in the table.


L E$ Print(value[top])

E E1 + T value[ntop] := value[top-2] + value[top]

E T E.val := T.val

T T1 * F value[ntop] := value[top-2] * value[top]

T F T.val := F.val

F (E) value[ntop] := value[top-1]


The value of ntop is set to top – r +1 . After each code fragment is executed, top is set to ntop.

The synthesized attributes in the annotated parse tree can be evaluated by an LR parser during a bottom-up parse of the input line 3*5+4.

The parse of the expression 3*5+4$ with the stack is shown in the table.

Input State Value Production used

3 * 5 + 4 $ - -

* 5 + 4 $ 3 3

* 5 + 4 $ F 3 F digit

* 5 + 4 $ T 3 T F

5 + 4 $ T * 3 *

+ 4 $ T * 5 3 * 5


+ 4 $ T * F 3 * 5 F digit

+ 4 $ T 15 T T * F

+ 4 $ E 15 E T

4 $ E + 15 +

$ E + 4 15 + 4

$ E + F 15 + 4 F digit

$ E + T 15 + 4 T F

$ E 19 E E + T

E $ 19

L 19 L E$

4.4 DESIGN OF PREDICTIVE TRANSLATION

The following algorithm generalizes the construction of predictive parsers to implement a translation scheme based on a grammar suitable for top-down parsing.

Algorithm 4.2: Construction of a predictive syntax-directed translator.

Input: A syntax-directed translation scheme with an underlying grammar suitable for predictive

parsing.

Output: Code for a syntax-directed translator.

Methud: The technique is a modification of the predictive-parser construction

1. For each nonterminal A, construct a function that has a formal parameter for each inherited attribute of A and that returns the values of the synthesized attributes of A

2. The code for nonterminal A decides what production to use based on the current input symbol.

3. The code associated with each production does the following. We consider the tokens, nonterminals, and actions on the right side of the production from left to right.

(i). For token X with synthesized attribute x, save the value of x in the variable declared for X.x. Then generate a call to match token X and advance the input.

(ii). For nonterminal B, generate an assignment c := B ( b1, b2, …, bk) with a function call on the right side, where b1, b2, …, bk are the variables for the inherited attributes of B and c is the variable far the synthesized attribute of B.

(iii). For an action, copy the code into the parser, replacing each reference to an attribute by the variable for that attribute.

Example 4.5: The grammar in is LL( 1 ) and hence suitable for tap-down parsing can be generated by predicting a suitable production rule.

E E1 + T { E.nptr := mknode('+', E1.nptr, T.nptr) }

E E1 - T { E.nptr := mkrtodr('-', El.nptr, T.nptr)}

E T { E.nptr := T.nptr }

E R { E.nptr := R.nptr }

R ε

T id {T.nptr :=mkleaf(id, id.entry)}

T num {T.nptr :=mkleaf(num, num.entry)}

Combine two of the E-productions to make the translator smaller. The new productions use token op to represent + and -.


E E1 op T { E.nptr := mknode('op', E1.nptr, T.nptr) }

E T { E.nptr := T.nptr }

E R { E.nptr := R.nptr }

R ε

T id {T.nptr :=mkleaf(id, id.entry)}

T num {T.nptr :=mkleaf(num, num.entry)}

4.5 TYPE SYSTEMS

The design of a type checker for a language is based on information about the syntactic constructs in the language. Each expression has associated types to language constructs.

lf both operands of the arithmetic operators of addition, subtraction and multiplication are of type integer, then the result is of type integer.

The result of the unary & operator is a pointer to the object referred to by the operand. If the type of the operand is '…', the type of the result is 'pointer to '…'.

Basic types are the atomic types with no internal structure as far as the programmer is concerned.

The basic types are boolean, character, inieger. and real.

Arrays, records, and sets, pointers and functions can also be treated as constructed types.

Type Expressions

A type expression is either a basic type or is formed by applying an operator called a type constructor to a type expression. The sets of basic types and constructors depend on the language to be checked.

The following are some of type expressions:

1. A basic type is a type expression. Typical basic types for a language include boolean, char, integer, float, and void(the absence of a value). type_error is a special basic type.

2. Since type expressions may be named, a type name is a type expression. 3. A type constructor applied to type expressions is a type expression. Constructors include:

a) Arrays: If T is a type expression, then array(I, T) is a type expression denoting the type of an array with elements of type T and index set I. I is often a range of integers. Ex. int a[25] ;

b) Products: If T1 and T2 are type expressions, then their Cartesian product T1 x T2 is a type expression. x associates to the left and that it has higher precedence. Products are introduced for completeness; they can be used to represent a list or tuple of types (e.g., for function parameters).

c) Records: A record is a data structure with named fields. A type expression can be formed by applying the record type constructor to the field names and their types.

d) Pointers: If T is a type expression, then pointer (T) is a type expression denoting the type "pointer to an object of type T". For example: int a; int *p=&a;

e) Functions: Mathematically, a function maps dements of one set (domain) to another set(range). Function F: DR.A type expression can be formed by using the type constructor for function types. We write s t for "function from type s to type t".

4. Type expressions may contain variables whose values are themselves type expressions.


Example 4.6: The array type int [2][3] can be read as "array of 2 arrays of 3 integers each" and written as a type expression array(2, array(3, integer)). This type is represented by the tree in Figure 6.14. The operator array takes two parameters, a number and a type.

Figure 4.5: Type expression for int [2] [3]

Type Systems

A type system is a collection of rules for assigning type expressions to the various parts of a program. A type checker implements a type system, The type systems are specified in a syntax-directed manner.

Different type systems may be used by different compilers or processors of the same language. For example, in Pascal, the type of an array includes the index set of the array, so a function with an array argument can only be applied to arrays with that index set.

Static and Dynamic Checking of Types

Checking done by a compiler is said to be static, while checking done when the target program runs is termed dynamic.

Any check can be done dynamically, if the target code carries the type of an element along with the value of that element.

A sound type system eliminates the need for dynamic checking fur type errors because it allows us to determine statically that these errors cannot occur when the target program runs.

In a sound type system, Type errors cannot occur when the target code run.

A language is strongly typed, if its compiler can guarantee that the programs it accepts will execute without type errors. Eg. For integers int array[255];.

Error Recovery

Since type checking has the potential for catching errors in programs. it is important for a type checker to do something reasonable when an error is discovered.

At the very least, the compiler must report the nature and location of the error.

It is desirable for the type checker to recover from errors, so it can check the rest of the input.

Since error handling affects the type checking rules. It has to be designed into the type system right from the start; the rules must be prepared to cope with errors. Coping with missing information requires for error handling.

4.6 SPECIFICATION OF A SIMPLE TYPE CHECKER

Specification of a simple type checker for a simple language in which the type of each identifier must be declared before the identifier is used. The type checker is a translation scheme that synthesizes the type of each expression from the types of its subexpressions. The type checker can handles arrays, pointers, statements, and functions.

Specification of a simple type checker includes the following:

array

array

integer

2

3


A Simple Language

Type Checking of Expressions

Type Checking of Statements

Type Checking of Functions

A Simple Language

The following grammar generates programs, represented by the nonterrninal P, consisting of a sequence of declarations D followed by a single expression E.

P D ; E

D D ; D | id : T

T char | integer | array[num] of T | T

A translation scheme for above rules:

P D ; E

D D ; D

D id : T {addtype(id.entry, T.type}

T char { T.type := char}

T integer { T.type := integer}

T T1 { T.type := pointer(T1.type)}

T array[num] of T1 { T.type := array(1..num.val, T1.type)}

Type Checking of Expressions

The synthesized attribute type for E gives the type expression assigned by the type system to the expression generated by E. The following semantic rules say that constants represented by the tokens literal and num have type char and integer, respectively:

Rule Semantic Rule

E literal { E.type := char }

E num { E.type := integer}

A function lookup(e) is used to fetch the type saved in the symbol-table entry pointed to by e. When an identifier appears in an expression, its declared type is fetched and assigned to the attribute type;

E id { E.type := lookup(id.entry}

The expression formed by applying the mod operator to two subexpressions of type integer has type integer; otherwise, its type is type_error. The rule is

E E1 mod E2 { E.type := if E1.type = integer and E2.type = integer

then integer

else typr_error}

In an array reference E1 [E2], the index expression E2 must have type integer, in which case the result is the dement type t obtained from the type array(s. t ) of E1; we make no use of the index set s of the array.

E E1 [ E2 ] { E.type := if E2.type = integer and E1.type = array(s,t)

then t

else typr_error}


Within expressions, the postfix operator ↑ yields the object pointed to by its operand. The type of E is the type of the object pointed to by the pointer E:

E E1 { E.type := if E1.type = ponter(t) then t

else typr_error}

Type Checking of Statements

Since language constructs like statements typically do not have values, the special basic type void can be assigned to them. If an error is detected within a statement, then the type type_error assigned.

The assignment statement, conditional statement, and while statements are considered for the type Checking. The Sequences of statements are separated by semicolons.

S id := E {S.type := if id.type = E,type then void

else type_error}

E if E then S1 { S.type := if E.type = Boolean then S1.type

else type_error}

E while E do S1 { S.type := if E.type = Boolean then S1.type

else type_error}

E S1 ; S2 { S.type := if S1.type = void and S2.type = void then void

else type_error}

Type Checking of Functions

The application of a function to an argument can be captured by the production

E E ( E )

in which an expression 1s the application of one expression to another. The rules for associating type expressions with nonterminal T can be augmented by the following production and action to permit function types in declarations.

T T1 '' T2 {T.type := T1.type T2.type}

Quotes around the arrow used as a function constructor distinguish it from the arrow used as the meta syrnbol in a production.

The rule for checking the type of a function application is

E E1 ( E2 ) { E.type := if E2.type = s and E1.type = s t

then t

else typr_error}

4.7 EQUIVALENCE OF TYPE EXPRESSIONS

If two type expressions are equal then return a certain type else return type_error.

It is important to have a precise definition to say that two type expressions are equivalent.

The key issue is whether a name in a type expression stands for itself or whether it is an abbreviation for another type expression.

For efficiency, compilers we representations that allow type equivalence to be determined quickly.

The notion of type equivalence implemented by a specific compiler can often be explained using the concepts of structural and name equivalence


In C, this is achieved by typedef and struct statement.

Structural Equivalence of Type Expressions

Type expressions are built from basic types and constructors, a natural notion of equivalence between two type expressions is structural equivalence; i .e., two expressions are either the same basic type , or are formed by applying the same constructor to structurally equivalent types. That is, two type expressions are structurally equivalent if and only if they are identical.

For example, the type expression integer is equivalent only to integer because they are the same basic type.

Similarly, pointer (integer) is equivalent only to pointer (integer) because the two are formed by applying the same constructor pointer to equivalent types.

The algorithm recursively compares the structure of type expressions without checking for cycles so it can be applied to a tree or a dag representation. It assumes that [he only type constructors are for arrays, products, pointers, and functions.

The constructed type array(n1, t1) and array(n2, t2) are equivalent iff n1=n2 and t1=t2.

Algorithm sequiv(s,t)

if (s and t are same basic type) then

return true

else if ( s = array(s1, s2) and t = array(t1, t2)) then

return (sequiv(s1,t1) and sequiv(s2,t2))

else if ( s = s1 x s2 and t = t1 x t2 ) then


else if ( s = pointer (s1) and t= pointer (t1) then

return (sequiv(s1,t1))

else if ( s = s1 s2 and t = t1 t2 ) then


else return false

Example 4.7: The encoding of type expressions in this example is from a C Compiler for fast checking of type equivalence.

BASIC TYPE ENCODING

boolean 0000

char 0001

integer 0010

real 0011

TYPE CONSTRUCTOR ENCODING

pointer 01

array 10

freturns 11

TYPE EXPRESSION ENCODING

char 000000 0001

freturns (char) 000011 0001


pointer (freturns (char)) 000111 0001

array (pointer (freturns (char))) 100111 0001

Names for Type Expressions

In some languages, types can be given names (Data type name). For example, in the Pascal program fragment.

type link = cell; var next : link;

last : link;

p : cell; q, r : cell;

The identifier link is declared to be a name for the type cell. The variables next, last, p, q, r are not identical type, because the type depends on the implementation.

Type graph is constructed to check the name equivalence.

Every time a type constructor or basic type is seen, a new node is created.

Every time a new type name is seen, a leaf is created.

two type expressions are equivalent if they are represented by the same node in the type graph.

Example 4.8: Consider Pascal program fragment

type link = cell; np = cell; nqr = cell;

var next : link;

last : link;

p : np;

q : nqr;

r : nqr;

The identifier link is declared to be a name for the type cell. new type names np and nqr have been introduced. since next and last are declared with the same type name, they are treated as having equivalent types. Similarly, q and r are treated as having equivalent types because the same implicit type name is associated with them. However, p, q, and next do not have equivalent types, since they all have types with different names.

Figure 4.6: Association of variables and nodes in the type graph.

Note that type name cel1 has three parents. All labeled pointer. An equal sign appears between the type name link and the node in the type graph to which it refers.

next last

link = pointer

cell

pointer pointer

p q q


Example: Check for equivalence of type expressions for the following C code:

typedef struct

{

int data[100];

int count;

} Stack;

typedef struct

{

int data[100];

int count;

} Set;

Stack x, y;

Set r, s;

Name equivalence: The most straightforward: two types are equal if, and only if, they have the same name. x and y would be of the same type and r and s would be of the same type, but the type of x or y would not be equivalent to the type of r or s.

x = y; valid

r = s; valid

structural equivalence: Two types are equal if, and only if, they have the same "structure"

x = r; valid

using Name equivalence & structural equivalence the two types Stack and Set are type equivalent.


4.8 TYPE CONVERSIONS

Consider expressions like x + i, where x is of type float and i is of type integer. Since the representation of integers and floating-point numbers is different within a computer and different machine instructions are used for operations on integers and floats, the compiler may need to convert one of the operands of + to ensure that both operands are of the same type when the addition occurs.

Suppose that integers are converted to floats when necessary, using a unary operator (float). For example, the integer 2 is converted to a float in the code for the expression 2 * 3 .14:

tl = (float) 2

t2 = tl * 3.14

The attribute E.type, whose value is either integer or float.

The rule associated with E El + E2 builds on the pseudocode

if (E1.type = integer and E2.type = integer) E.type = integer;

else if (E1.type = float and E2. type = integer) E.type = float;

else if (E1.type = integer and E2. type = float) E.type = float;

else if (E1.type = float and E2. type = float) E.type = float;

Type conversion rules vary from language to language. The rules for Java in Figure 4.7 distinguish between widening conversions, which are intended to preserve information, and narrowing conversions, which can lose information.

(a) Widening conversions (b) Narrowing conversions

Figure 4.7: Conversions between primitive types in Java

Coercions

Conversion from one type to another is said to be implicit if it is done automatically by the compiler. Implicit type conversions, also called coercions, are limited in many languages to widening conversions. Conversion is said to be explicit if the programmer must write something to cause the conversion. Explicit conversions are also called casts.

The semantic action for checking E El + E2 uses two functions:

1. max(tl, t2) takes two types tl and t2 and returns the maximum (or least upper bound) of the two types in the widening hierarchy. It declares an error if either tl or t2 is not in the hierarchy; e.g., if either type is an array or a pointer type.

2. widen(a, t, w) generates type conversions if needed to widen an address a of type t into a value of type w. It returns a itself if t and w are the same type. Otherwise, it

double

float

long

int

short char

byte

double

float

long

int

short char byte


generates an instruction to do the conversion and place the result in a temporary t, which is returned as the result.

Pseudocode for widen, assuming that the only types are integer and float.

Addr widen(Addr a, Type t, Type w)

{

if ( t = w ) return a;

else if ( t = integer and w = float )

{

temp = new Temp();

gen(temp '=' '(float)' a);

return temp;

}

else error;

}

Introducing type conversions into expression evaluation

E El + E2 {E.type = max(E1.type,E2.type);

a1 = widen(E1. addr, E1 .type, E.type);

a2 = widen(E2. addr, E2 .type, E.type);

E.addr = new Temp();

gen(E. addr '=' a1 '+' a2); }

Example 4.9. Consider expressions formed by applying an arithmetic operator ap to constants and identifiers, as in the grammar. Suppose there are two types - real and integer, with integers converted to reals when necessary. Attribute type of nonterminal E can be either integer or real, and the type-checking rules are shown below, function lookup(e) returns the type saved in the symbol-table entry pointed to by e.

PRODUCTION SEMANTIC RULE

E num E.type = integer

E num . num E.type = real

E id E.type = lookup(id.entry)

E E1 op E2 E.type = if (E1.type = integer and E2.type = integer)

then integer

else if (E1.type = integer and E2. type = real)

then real

else if (E1.type = real and E2. type = integer)

then real

else if (E1.type = real and E2. type = real)

then real

else type_error


4.9 RUN-TIME ENVIRONMENT: SOURCE LANGUAGE ISSUES

Run-Time Environment

Run Time Environment establishes relationships between names and data objects.

The allocation and de-allocation of data objects are managed by the Run Time Environment

Each execution of a procedure is referred to as an activation of the procedure.

If the procedure is recursive, several of its activations may & alive at the same time. Each call of a procedure leads to an activation that may manipulate data objects allocated for its use.

The representation of a data object at run time is determined by its type.

Often, elementary data types, such as characters, integers, and reals can be represented by equivalent data objects in the target machine.

However, aggregates, such as arrays, strings, and structures, are usually represented by collections of primitive objects.

Source Language Issues

1. Procedure 2. Activation Trees 3. Control Stack 4. The Scope of a Declaration 5. Bindings of Names

Procedure

A procedure definition is a declaration that associates an identifier with a statement. The identifier is the procedure name and the statement is the procedure body.

A procedure returns value for the called function.

A complete program will also be treated as a procedure.

When a procedure name appears within an executable statement, we say that the procedure is called at that point.

The basic idea is that a procedure call executes the procedure body.

Some of the identifiers appearing in a procedure definition are special, and are called formal parameters of the procedure.

Actual parameters may be passed to a called procedure.

Procedures can contains local and global variables.

Activation Trees

We make the following assumptions about the flow of control among procedures during the execution of a program:

1. Control flows sequentially; that is, the execution of a program consists of a sequence of steps, with control king at some specific point in the program at each step.

2. Each execution of a procedure starts at the beginning of the procedure body and eventually returns control to the point immediately following the place where the procedure was called. This means the flow of control between procedures can be depicted using trees.

Each execution of a procedure body Is referred to as an activation of the procedure, The lifetime of an activation of a procedure p is the sequence of steps between the first and last steps in the execution of the procedure body.

If a and b arc procedure activations, then their lifetimes are either non-overlapping or nested.


A procedure is recursive if a new activation can begin before an earlier activation of the same procedure has ended.

The lifetime of the activation quicksort (1, 9) is the sequence of steps executed between printing enter quicksort (1, 9) and printing leave quicksort(l, 9).

The following are the rules to construct an activation tree:

1. Each node represents an activation of a procedure. 2. The root node represents the activation of the main program. 3. The node for a is the parent of the node for b if and only if control flows from activation a

to b. 4. The node for a is to the left of the node for b if and only if the lifetime of a occurs before

the lifetime of b.

enter main()

enter readarray()

leave readarray()

enter quicksort(1,9)

enter partition(l, 9)

leave partition(1, 9)

enter quicksort(l, 3)

. . .

leave quicksort(1,3)

enter quicksort(5, 9)

. . .

leave quicksort (5,9)

leave quicksort(1, 9)

leave main()

Figure 4.8: An activation tree corresponding to the output of activation of quicksort

Control Stack

The flow of control in a program corresponds to a depth-first traversal of the activation tree that starts at the root, visits a node before its children, and recursively visits children at each node in a left-to-right order.

We can use a stack, called a control stack to keep track of live procedure activations; the idea is to push the node for activation onto the control stack as the activation begins and to pop the node when the activation ends. Then the contents of the control stack are related to paths to the root of the activation tree. When node n is at the top of the control stack, the stack contains the nodes along the path from n to the root.


Example 4.10: Figure 4.8 shows nodes from the activation tree of Figure 4.9 that have been reached when control enters the activation represented by q(2,3 ) . Activations with labels r, p(1, 9), p(1, 3), and q(1, 3) have executed to completion, so the figure contains dashed lines to their nodes. The solid lines mark the path from q(2, 3) to the root.

Figure 4.9: The control stack contains nodes along a path to the root.

The Scope of a Declaration

A declaration in a language is a syntactic construct that associates information with a name. Declarations may be explicit, as in the Pascal fragment var i : integer; or they may be Implicit. For example, any variable name starting with I is assumed to denote an integer in a Fortran program, unless otherwise declared.

The scope rules of a language determine which declaration of a name applies when the name appears in the text of a program.

The portion of the program to which a declaration applies is called the scope of that declaration. An occurrence of a name in a procedure is said to be local to the procedure if it is in the scope of a declaration within the procedure; otherwise, the occurrence is said to be nonlocal.

At compile time, the symbol table can be used to find the declaration that applies to an occurrence of a name.

Special, static, global, volatile, final and so on are also used to declare variables.

Bindings of Names

Even if each name is declared once in a program, the same name may denote different data objects at run time. The informal term "data object" corresponds to a storage location that can hold values.

In programming language semantics, the term environment refers to a function that maps a name to a storage location, and the term state refers to a function that maps a storage location to the value held here as in Figure 4.10.

Figure 4.10: Two-stage mapping from names to values

Environments and states are different; an assignment changes the state, but not the environment. For example, suppose that storage address 100, associated with variable pi, holds 0. After the assignment pi := 3. 14, the same storage address is associated with pi, but the value held there is 3.14.

name storage

environment

value

state


When an environment associates storage location s with a name x, we say that x is bound to s; the association itself is referred to as a binding of x. The term storage "location" is to be taken figuratively. If x is not of a basic type, the storage s for x may be a collection of memory words.

Static notion Dynamic counterpart

definition of a procedure activations of the procedure

Declaration of a name bindings of the name

Scope of a declaration lifetime of a binding

4.10 STORAGE ORGANIZATION

The executing target program runs in its own logical address space in which each program value has a location. The management and organization of this logical address space is shared between the compiler, operating system, and target machine. The operating system maps the logical addresses into physical addresses, which are usually spread throughout memory.

The run-time representation of an object program in the logical address space consists of data and program areas as shown in Figure. A compiler for a language like C++ on an operating system like Linux might subdivide memory in this way.

The run time storage is subdivided to hold code and data as follows:

The generated target code

Data objects

Control stack(which keeps track of information of procedure activations0

Figure 4.11: Typical subdivision of run-time memory into code and data areas

The size of the generated target code is fixed at compile time, so the compiler can place the executable target code in a statically determined area Code, usually in the low end of memory.

The size of some program data objects, such as global constants, and data generated by the compiler, such as information to support garbage collection, may be known at compile time, and these data objects can be placed in another statically determined area called Static. One reason for

Code

Stack

Heap

Free Memory

Static data

0…0

F…F


statically allocating as many data objects as possible is that the addresses of these objects can be compiled into the target code. In early versions of Fortran, all data objects could be allocated statically.

To maximize the utilization of space at run time, the other two areas, Stack and Heap, are at the opposite ends of the remainder of the address space. These areas are dynamic; their size can change as the program executes. These areas grow towards each other as needed. The stack is used to store data structures called activation records that get generated during procedure calls.

Activation Records

Procedure calls and returns are usually managed by a run-time stack called the control stack. Each live activation has an activation record (sometimes called a frame) on the control stack. The contents of activation records vary with the language being implemented.

Figure 4.12: A general activation record

The following are the contents in an activation record

1. Temporary values, such as those arising from the evaluation of expressions, in cases where those temporaries cannot be held in registers.

2. Local data belonging to the procedure whose activation record this is. 3. A saved machine status, with information about the state of the machine just before the call

to the procedure. This information typically includes the return address and the contents of registers that were used by the calling procedure and that must be restored when the return occurs.

4. An "access link" may be needed to locate data needed by the called procedure but found elsewhere, e.g., in another activation record.

5. A control link, pointing to the activation record of the caller. 6. Space for the return value of the called function, if any. Again, not all called procedures

return a value, and if one does, we may prefer to place that value in a register for efficiency. 7. The actual parameters used by the calling procedure. Commonly, these values are not

placed in the activation record but rather in registers.

Actual parametersode

Access link

Control link

Returned values

Temporaries

Local data

Saved machine status


4.11 STORAGE ALLOCATION

There are basically three storage-allocation strategy is used in each of the three data areas in the organization.

1. Static allocation lays out storage for all data objects at compile time. 2. Stack allocation manages the run-time storage as a stack, 3. Heap allocation allocates and de-allocates storage as needed at run time from a data area

known as a heap,

1. Static Allocation

In static allocation, names are bound to storage as the program is compiled, so there is no need for a run-time support package.

Since the bindings do not change at run time, every time a procedure is activated, its names are bound to the same storage locations.

The above property allows the values of local names to be retained across activations of a procedure. That is, when control returns to a procedure, the values of the locals are the same as they were when control left the last time.

From the type of a name, the compiler determines the amount of storage to set aside for that name.

The address of this storage consists of an offset from an end of the activation record for the procedure.

The compiler must eventually decide where the activation records go, relative to the target code and to one another.

The following are the limitations for static memory allocation.

1. The size of a data object and constraints on its position in memory must be known at compile time.

2. Recursive procedures are restricted, because all activations of a procedure use the same bindings for local names.

3. Dynamic allocation is not allowed. So Data structures cannot be created dynamically.

2. Stack Allocation

1. Stack allocation is based on the idea of a control slack. 2. A stack is a Last In First Out (LIFO) storage device where new storage is allocated and

deallocated at only one ``end'', called the Top of the stack. 3. Storage is organized as a stack, and activation records are pushed and popped as activations

begin and end, respectively. 4. Storage for the locals in each call of a procedure is contained in the activation record for

that call. Thus locals are bound to fresh storage in each activation, because a new activation record is pushed onto the stack when a call is made.

5. Furthermore, the values of locals are detected when the activation ends; that is, the values are lost because the storage for locals disappears when the activation record is popped.

6. At run time, an activation record can be allocated and de-allocated by incrementing and decrementing top of the stack respectively.

a. Calling sequence


The layout and allocation of data to memory locations in the run-time environment are key issues in storage management. These issues are tricky because the same name in a program text can refer to multiple locations at run time.

The two adjectives static and dynamic distinguish between compile time and run time, respectively. We say that a storage-allocation decision is static, if it can be made by the compiler looking only at the text of the program, not at what the program does when it executes.

Conversely, a decision is dynamic if it can be decided only while the program is running. Many compilers use some combination of the following two strategies for dynamic storage allocation:

1. Stack storage. Names local to a procedure are allocated space on a stack. The stack supports the normal call/return policy for procedures.

2. Heap storage. Data that may outlive the call to the procedure that created it is usually allocated on a "heap" of reusable storage.

Figure 4.13: Division of tasks between caller and callee

The code for the callee can access its temporaries and local data using offsets from top-sp. The call sequence is:

1. The caller evaluates actual. 2. The caller stores a return address and the old value of top-sp into the cake's activation

record. The caller then increments top-sp to the position shown in Figure 4.17 That is, top-sp is moved past the caller's local data and temporaries and the calk's parameter and status fields.

3. The callee saves register values and other status information. 4. The callee initializes its local data and begins execution.

A possible return sequence is:

1. The callee places a return value next to the activation record of the caller.


2. Using the information in the status field, the callee restores top-sp and other registers and branches to a return address in the caller's code.

3. Although top-sp has been decremented, the caller can copy the returned value into its own activation record and use it to evaluate an expression.

b. variable-length data

1. Variable-length data are not stored in the activation record. Only a pointer to the beginning of each data appears in the activation record.

2. The relative addresses of these pointers are known at compile time.

c. dangling references

1. A dangling reference occurs when there is a reference to storage that has been deallocated.

2. It is a logical error to use dangling references, Since the value of deallocated storage is undefined according to the semantics of most languages.

3. Heap allocation

1. The deallocation of activation records need not occur in a last-in first-out fashion, so storage cannot be organized as a stack.

2. Heap allocation parcels out pieces of contiguous storage, as needed for activation records or other objects. Pieces may be deallocated in any order. So over time the heap will consist of alternate areas that are free and in use.

3. Heap is an alternate for stack.

4.12 PARAMETER PASSING

All programming languages have a notion of a procedure, but they can differ in how these procedures get their arguments. The actual parameters (the parameters used in the call of a procedure) are associated with the formal parameters (those used in the procedure definition).

Call-by-value

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable). The value is placed in the location belonging to the corresponding formal parameter of the called procedure. This method is used in C and Java.

The actual parameters are evaluated and their r-values arc passed to the called procedure. Call-by-value can be implemented as follows:

o A formal parameter is treated just like a local name, so the storage for the formals is in the activation record of the called procedure.

o The caller evaluates the actual parameters and places their r-values in the storage for the formals.

Call- by-reference

In call- by-reference, the address of the actual parameter is passed to the callee as the value of the corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented by following this pointer to the location indicated by the caller. Changes to the formal parameter thus appear as changes to the actual parameter.

When parameters are passed by reference (also known as call-by-address or call-by-location), the caller passes to the called procedure a pointer to the storage address of each actual parameter.

o If an actual parameter is a name or an expression having an l-value. Then that l-value itself is passed.


o However, if the actual parameter is an expression, like a + b or 2, that has no l-value, then the expression is evaluated in a new location, and the address of that location is passed.

Copy restore

A hybrid between call-by-value and call-by-reference is copy-restore linkage(copy-in copy-oat. or va1ue-result).

Before control flows to the called procedure, The actual parameters are evaluated. The r-values of the actuals are passed to the called procedure as in call-by-value.

When control returns, the current r-values of the formal parameters are copied back into the l-values of the actuals.

Call-by-name

A mechanism call-by-name was used in the early programming language Algol 60. It requires that the callee execute as if the actual parameter were substituted literally for the formal parameter in the code of the callee, as if the formal parameter were a macro standing for the actual parameter.

Call-by-name is traditionally defined by the copy-rule of Algol.

1. The procedure is treated as if it were a macro; that is, its body is substituted for the call in the caller, with the actual parameters literally substituted for the formals. Such a literal substitution is called macro-expansion or in-line expansion.

2. The local names of the called procedure are kept distinct from the names of the calling procedure. each local of the called procedure being systematically renamed into a distinct new name before the macro-expansion is done.

3. The actual parameters are surrounded by parentheses if necessary to preserve their integrity.

4.13 SYMBOL TABLES

Symbol tables are data structures that are used by compilers to hold information about source-program constructs. The information is collected incrementally by the analysis phases of a compiler and used by the synthesis phases to generate the target code. Entries in the symbol table contain information about an identifier such as its character string (or lexeme) , its type, its position in storage, and any other relevant information.

Figure 4.14: interaction among Symbol table and various phases of compiler

The symbol table, which stores information about the entire source program, is used by all phases of the compiler.

Syntax

Analyzer

Symbol

Table

Semantic

Analyzer

Intermediate

code

generator

Code

optimizer

Lexical

Analyzer

Code

generator


An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name.

These attributes may provide information about the storage allocated for a name, its type, its scope.

In the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (for example, by value or by reference), and the type returned are maintained in symbol table.

The symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name. The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly.

A symbol table can be implemented in one of the following ways: o Linear (sorted or unsorted) list o Binary Search Tree o Hash table

Among the above all, symbol tables are mostly implemented as hash tables, where the source code symbol itself is treated as a key for the hash function and the return value is the information about the symbol.

A symbol table may serve the following purposes depending upon the language in hand: o To store the names of all entities in a structured form at one place. o To verify if a variable has been declared. o To implement type checking, by verifying assignments and expressions. o To determine the scope of a name (scope resolution).

Symbol-Table Entries

A compiler uses a symbol table to keep track of scope and binding information about names. The symbol table is searched every time a name is encountered in the source text. Changes to the table occur if a new name or new Information about an existing name is discovered. A linear list is the simplest to implement, but its performance is poor. Hashing schemes provide better performance.

The symbol table grows dynamically even though fixed at compile time.

Each entry in the symbol table is for the declaration of a name.

The format of entries does not uniform.

Each entry can be implemented as a record consisting of a sequence of consecutive words of memory.

To keep symbol-table records uniform; it may be convenient for some of the information about a name to be kept outside the table entry, with only a pointer to this information stored in the record.

The following information about identifiers are stored in symbol table. o The name. o The data type. o The block level. o Its scope (local, global). o Pointer / address o Its offset from base pointer o Function name, parameter,and variable.

Characters in a Name

There is a distinction between the token id for an identifier or name.

The lexeme consisting of the character string forming the name, and the attributes of the name.


Strings of characters may be unwieldy to work with, so compilers often use some fixed-length representation of the name rather than the lexeme.

The lexeme is needed when a symbol-table entry is set up for the first time, and when we look up a lexeme found in the input to determine whether it is a name that has already appeared.

A common representation of a name is a pointer to a symbol-table entry for it.

If there is a modest upper bound on the length of a name, then the characters in the name can be stored in the symbol-table entry, as in Figure 4.15.

Figure 4.15: Symbol table names In fixed-size space within a record

If there is no limit on the length of a name, or if the limit is rarely reached, the indirect scheme of Figure 4.16 can be used.

Figure 4.16: symbol table names In a separate array

Storage Allocation Information

Information about the storage locations that will be bund to names at run time is kept in the symbol table.

Static and dynamic allocation can be done.

Storage is allocated for code, data, stack, and heap.

COMMON blocks in Fortran are loaded separately.

The List Data Structure for Symbol Tables

The compiler plans out the activation record for each procedure.

The simplest and easiest to implement data structure for a symbol table is a linear list of records as shown in figure 4.17.


We use a single array, or equivalently several arrays. to store names and their associated information.

If the symbol table contains n names, To find the data about a name, on the average, we search n/2 names, so the cost of an inquiry is also proportional to n.

Figure 4.17. A linear list of records.

Hash Tables for Symbol Tables

Variations of the searching technique known as hashing have been implemented in many compilers.

open hashing is a simplest variant of searching technique.

Even this scheme gives us the capability of performing e inquiries on n names in time proportional to n ( n+e) / m, for any constant m of our choosing.

This method is generally more efficient than linear lists and is the method of chow for symbol tables in most situations.

The basic hashing scheme is illustrated in Figure 4.34. There are two parts to the data structure: 1. A hash table consisting of a fixed array of m pointers to table entries. 2. Table entries organized into m separate linked lists, called buckers (some buckets may

be empty). Each record in the symbol table appears on exactly one of these lists.

Id1

Info1

Id1

Info1

. . .

Idn

Infon

available


Figure 4.18: A hash table of size 210.

Representing Scope Information

A simple approach is to maintain a separate symbol table for each scope. In effect, the symbol table for a procedure or scope is the compile time equivalent of an activation record. Linked list is best to represent the Scope Information.

Figure 4.19: The most recent entry for a is near the front.

4.14 DYNAMIC STORAGE ALLOCATION

The techniques needed to implement dynamic storage allocation is mainly depends on how the storage deallocated. If deallocation is implicit, then the run-time support package is responsible for determining when a storage block is no longer needed. There is less a compiler has to do if deallocation is done explicitly by the programmer.

Explicit Allocation of Fixed-Sized Blocks

The simplest form of dynamic allocation involves blocks of a fixed size. By linking the blocks in a list, as in Figure 4.41. Allocation and deallocation can be done quickly with little or no storage overhead.

Figure 4.20: A deallocated block is added to the lit of available blocks.


Suppose that blocks are to be drawn from a contiguous area of storage. Initialization of the area is done by using a portion of each block for a link to the next block. A pointer available points to the first block. Allocation consists of taking a block off the list and deallocation consists of putting the block back on the list.

Explicit Allocation of Variable-Sized Blocks

When blocks are allocated and deallocated, storage can become fragmented; that is, the heap may consist of alternate blocks that are free and in use, as in Figure 4.42.

Figure 4.21: Free and used blocks in a heap.

The situation shown in Figure4.42 can occur if a program allocates five blocks and then de-allocates the second and fourth, For example. Fragmentation is of no consequence if blocks are of fixed size, but if they are of variable size, a situation like Figure 7.42 is a problem, because we could not allocate a block larger than any one of the free blocks, even though the space is available. First fit, worst fit and best fit are some methods for allocating variable-sized blocks

Implicit Deallocation

Implicit deallocation requires cooperation between the user program and the run-time package, because the latter needs to know when a storage block is no longer in use. This cooperation is implemented by fixing the format of storage blocks, the format of a storage block is as shown in Figure 4.43.

Figure 4.22: The format of a block.

Reference counts: We keep track of the number of blocks that point directly to the present block. If this count ever drops to 0, then the block can be deallocated because it cannot be referred to. In other words, the block has become garbage that can be collected. Maintaining reference counts can be costly in time.

Marking techniques: An alternative approach is to suspend temporarily execution of the user program and use the frozen pointers to determine which blocks are in use.

4.15 STORAGE ALLOCATION IN FORTAN.

FORTRAN was designed to permit static storage allocation. However, there are some issues, such as the treatment of COMMON and EQUIVALENCE declarations, that are fairly special to Fortran.


A Fortran compiler can create a number of data areas, i-e., blocks of storage in which the values of objects can be stored.

There is one data area for each procedure and one data area for each named COMMON block and for blank COMMON, if used.

The symbol table must record for each name the data area in which it belongs and its offset in that data area, that is, its position relative to the beginning of the area.

The compiler must eventually decide where the data areas go relative to the executable code and to one another, but this choice is arbitrary, since the data areas are independent.

DATA in COMMON Areas

A record is created for each block with the first and last names of the current procedure, that is declared to be in that COMMON block.

A declaration is: COMMON / BLOCK1 / NAMEl, NAME2

The compiler must do the following:

1. In the table for COMMON block names, create a record for BLOCK1, if one does not already exist.

2. In the symbol-table entries for NAME1 and NAME2, set a pointer to the symbol-table entry for BLOCK1, indicating that these are in COMMON and members of BLOCK1.

3. a) If the record has just now been created for BLOCK1. set a pointer in that record to the symbol-table entry for NAME1, indicating the first name in this COMMON block. Then, link the symbol-table entry for NAME1 to that for NAME2, using a field of the symbol table reserved for linking members of the same COMMON block. Finally, set a pointer in the record for BLOCK1 to the symbol-table entry for NaME2, indicating the last found member of that block.

b) If, however, this is not the first declaration of BLOCK1, simply link NAME1 and NAME2 to the end of the list of names for BLOCK1. The pointer to the end of the list for BLOCK1, appearing in the record for BLOCK1.

After a procedure has been processed, we call the equivalence algorithm; A bit in the symbol-table entry for XYZ is set, indicating that XYZ has been equivalenced to something. Create a memory map for each COMMON block by scanning the list of names for that block.

EQUIVALENCE statements

The first algorithms for processing equivalence statements appeared in assemblers rather than compilers. Since these algorithms can be a bit complex, especially when interactions between COMMON and EQUIVALENCE statements are considered, let us treat first a situation typical of an assembly language, where the only EQUIVALENCE statements are of the 'form

EQUIVALENCE A, B+offset

where A and B are the names of locations. This statement makes A denote the location that is offset memory units beyond the location for B.

A sequence of EQUIVALENCE statements groups names into equivalence sets whose positions relative to one another are all defined by the EQUIVALENCE statements,

EQUIVALENCE A, B+100

EQUIVALENCE C, D-40

CS6660 Compiler Design Unit V 5.1

UNIT V SYNTAX ANALYSIS

5.1 PRINCIPAL SOURCES OF OPTIMIZATION

A compiler optimization must preserve the semantics of the original program.

Except in very special circumstances, once a programmer chooses and implements a

particular algorithm, the compiler cannot understand enough about the program to replace it

with a substantially different and more efficient algorithm.

A compiler knows only how to apply relatively low-level semantic transformations, using

general facts such as algebraic identities like i + 0 = i.

5.1.1 Causes of Redundancy

There are many redundant operations in a typical program. Sometimes the redundancy is

available at the source level.

For instance, a programmer may find it more direct and convenient to recalculate some

result, leaving it to the compiler to recognize that only one such calculation is necessary.

But more often, the redundancy is a side effect of having written the program in a high-level

language.

As a program is compiled, each of these high level data structure accesses expands into a

number of low-level pointer arithmetic operations, such as the computation of the

location of the (i, j)th element of a matrix A.

Accesses to the same data structure often share many common low-level operations.

Programmers are not aware of these low-level operations and cannot eliminate the

redundancies themselves.

5.1.2 A Running Example: Quicksort

Consider a fragment of a sorting program called quicksort to illustrate several important

code improving transformations. The C program for quicksort is given below

void quicksort(int m, int n)

/* recursively sorts a[m] through a[n] */

{

int i, j;

int v, x;

if (n <= m) return;

/* fragment begins here */

i=m-1; j=n; v=a[n];

while(1) {

do i=i+1; while (a[i] < v);

do j = j-1; while (a[j] > v);

i f (i >= j) break;

x=a[i]; a[i]=a[j]; a[j]=x; /* swap a[i], a[j] */

}

x=a[i]; a[i]=a[n]; a[n]=x; /* swap a[i], a[n] */

/* fragment ends here */

quicksort (m, j); quicksort (i+1, n) ;

}

Figure 5.1: C code for quicksort


Intermediate code for the marked fragment of the program in Figure 5.1 is shown in Figure

5.2. In this example we assume that integers occupy four bytes. The assignment x = a[i] is

translated into the two three address statements t6=4*i and x=a[t6] as shown in steps (14) and (15)

of Figure. 5.2. Similarly, a[j] = x becomes t10=4*j and a[t10]=x in steps (20) and (21).

Figure 5.2: Three-address code for fragment in Figure.5.1

Figure 5.3: Flow graph for the quicksort fragment of Figure 5.1

Figure 5.3 is the flow graph for the program in Figure 5.2. Block B1 is the entry node. All

conditional and unconditional jumps to statements in Figure 5.2 have been replaced in Figure 5.3

by jumps to the block of which the statements are leaders. In Figure 5.3, there are three loops.


Blocks B2 and B3 are loops by themselves. Blocks B2, B3, B4, and B5 together form a loop, with B2

the only entry point.

5.1.3 Semantics-Preserving Transformations

There are a number of ways in which a compiler can improve a program without changing

the function it computes. Common subexpression elimination, copy propagation, dead-code

elimination, and constant folding are common examples of such function-preserving (or semantics

preserving) transformations.

(a) Before (b)After

Figure 5.4: Local common-subexpression elimination

Some of these duplicate calculations cannot be avoided by the programmer because they lie

below the level of detail accessible within the source language. For example, block B5 shown in

Figure 5.4(a) recalculates 4 * i and 4 *j, although none of these calculations were requested

explicitly by the programmer.

5.1.4 Global Common Subexpressions

An occurrence of an expression E is called a common subexpression if E was previously

computed and the values of the variables in E have not changed since the previous computation.

We avoid re-computing E if we can use its previously computed value; that is, the variable x to

which the previous computation of E was assigned has not changed in the interim.

The assignments to t7 and t10 in Figure 5.4(a) compute the common subexpressions 4 * i

and 4 * j, respectively. These steps have been eliminated in Figure 5.4(b), which uses t6 instead of

t7 and t8 instead of t10.

Figure 9.5 shows the result of eliminating both global and local common subexpressions

from blocks B5 and B6 in the flow graph of Figure 5.3. We first discuss the transformation of B5

and then mention some subtleties involving arrays.

After local common subexpressions are eliminated, B5 still evaluates 4*i and 4 * j, as shown

in Figure 5.4(b). Both are common subexpressions; in particular, the three statements

t8=4*j

t9=a[t8]

a[t8]=x

in B5 can be replaced by

t9=a[t4]

a[t4]=x

using t4 computed in block B3. In Figure 5.5, observe that as control passes from the evaluation of

4 * j in B3 to B3, there is no change to j and no change to t4, so t4 can be used if 4 * j is needed.


Another common subexpression comes to light in B5 after t4 replaces t8. The new

expression a[t4] corresponds to the value of a[j] at the source level. Not only does j retain its value

as control leaves B3 and then enters B5, but a[j], a value computed into a temporary t5, does too,

because there are no assignments to elements of the array a in the interim. The statements

t9=a[t4]

a[t6]=t9

in B5 therefore can be replaced by

a[t6]=t5

Analogously, the value assigned to x in block B5 of Figure 5.4(b) is seen to be the same as

the value assigned to t3 in block B2. Block B5 in Figure 5.5 is the result of eliminating common

subexpressions corresponding to the values of the source level expressions a[i] and a[j] from B5 in

Figure 5.4(b). A similar series of transformations has been done to B6 in Figure 5.5.

The expression a[tl] in blocks B1 and B6 of Figure 5.5 is not considered a common

subexpression, although tl can be used in both places. After control leaves B1 and before it reaches

B6, it can go through B5, where there are assignments to a. Hence, a[tl] may not have the same

value on reaching B6 as it did on leaving B1, and it is not safe to treat a[tl] as a common

subexpression.

Figure 5.5: B5 and B6 after common-subexpression elimination

5.1.5 Copy Propagation

Block B5 in Figure 5.5 can be further improved by eliminating x, using two new

transformations. One concerns assignments of the form u = v called copy statements, or copies for


short. Copies would have arisen much sooner, because the normal algorithm for eliminating

common subexpressions introduces them, as do several other algorithms.

(a) (b)

Figure 5.6: Copies introduced during common subexpression elimination

In order to eliminate the common subexpression from the statement c = d+e in Figure

5.6(a), we must use a new variable t to hold the value of d + e. The value of variable t, instead of

that of the expression d + e, is assigned to c in Figure 5.6(b). Since control may reach c = d+e

either after the assignment to a or after the assignment to b, it would be incorrect to replace c = d+e

by either c = a or by c = b.

The idea behind the copy-propagation transformation is to use v for u, wherever possible

after the copy statement u = v. For example, the assignment x = t3 in block B5 of Figure 5.5 is a

copy. Copy propagation applied to B5 yields the code in Figure 5.7. This change may not appear to

be an improvement, but, it gives us the opportunity to eliminate the assignment to x.

Figure 5.7: Basic block B5 after copy propagation

5.1.6 Dead-Code Elimination

A variable is live at a point in a program if its value can be used subsequently; otherwise, it

is dead at that point. A related idea is dead (or useless) code - statements that compute values that

never get used. While the programmer is unlikely to introduce any dead code intentionally, it may

appear as the result of previous transformations.

Deducing at compile time that the value of an expression is a constant and using the

constant instead is known as constant folding.

One advantage of copy propagation is that it often turns the copy statement into dead code.

For example, copy propagation followed by dead-code elimination removes the assignment to x

and transforms the code in Figure 5.7 into

This code is a further improvement of block B5 in Figure 5.5.

5.1.7 Code Motion

Loops are a very important place for optimizations, especially the inner loops where

programs tend to spend the bulk of their time. The running time of a program may be improved if

we decrease the number of instructions in an inner loop, even if we increase the amount of code

outside that loop.


An important modification that decreases the amount of code in a loop is code motion. This

transformation takes an expression that yields the same result independent of the number of times a

loop is executed (a loop-invariant computation) and evaluates the expression before the loop.

Evaluation of limit - 2 is a loop-invariant computation in the following while statement :

while (i <= limit-2) /* statement does not change limit */

Code motion will result in the equivalent code

t = limit-2

while ( i <= t ) /* statement does not change limit or t */

Now, the computation of limit - 2 is performed once, before we enter the loop. Previously, there

would be n + 1 calculations of limit - 2 if we iterated the body of the loop n times.

5.1.8 Induction Variables and Reduction in Strength

Another important optimization is to find induction variables in loops and optimize their

computation. A variable x is said to be an "induction variable" if there is a positive or negative

constant c such that each time x is assigned, its value increases by c. For instance, i and t2 are

induction variables in the loop containing B2 of Figure 5.5. Induction variables can be computed

with a single increment (addition or subtraction) per loop iteration. The transformation of replacing

an expensive operation, such as multiplication, by a cheaper one,

such as addition, is known as strength reduction. But induction variables not only allow us

sometimes to perform a strength reduction; often it is possible to eliminate all but one of a group of

induction variables whose values remain in lock step as we go around the loop.

Figure 5.8: Strength reduction applied to 4 * j in block B3


When processing loops, it is useful to work "inside-out" ; that is, we shall start with the

inner loops and proceed to progressively larger, surrounding loops. Thus, we shall see how this

optimization applies to our quicksort example by beginning with one of the innermost loops: B3 by

itself. Note that the values of j and t4 remain in lock step; every time the value of j decreases by 1,

the value of t4 decreases by 4, because 4 * j is assigned to t4. These variables, j and t4, thus form a

good example of a pair of induction variables.

When there are two or more induction variables in a loop, it may be possible to get rid of all

but one. For the inner loop of B3 in Fig. 9.5, we cannot get rid of either j or t4 completely; t4 is

used in B3 and j is used in B4. However, we can illustrate reduction in strength and a part of the

process of induction-variable elimination. Eventually, j will be eliminated when the outer loop

consisting of blocks B2, B3, B4 and Bs is considered.

Figure 5.9: Flow graph after induction-variable elimination

After reduction in strength is applied to the inner loops around B2 and B3, the only use of i and j is

to determine the outcome of the test in block B4. We know that the values of i and t2 satisfy the

relationship t2 = 4 * i, while those of j and t4 satisfy the relationship t4 = 4* j. Thus, the test t2 >=

t4 can substitute for i >= j. Once this replacement is made, i in block B2 and j in block B3 become

dead variables, and the assignments to them in these blocks become dead code that can be

eliminated. The resulting flow graph is shown in Figure. 5.9.

Note:

1. Code motion, induction variable elimination and strength reduction are loop optimization

techniques.

2. Common sub expression elimination, copy propogation dead code elimination and constant

folding are function preserving transformations.


5.2 DIRECTED ACYCLIC GRAPHS (DAG)

Like the syntax tree for an expression, a DAG has leaves corresponding to atomic operands

and interior codes corresponding to operators. The difference is that a node N in a DAG has more

than one parent if N represents a common subexpression; in a syntax tree, the tree for the common

subexpression would be replicated as many times as the subexpression appears in the original

expression. Thus, a DAG not only represents expressions more succinctly, it gives the compiler

important clues regarding the generation of efficient code to evaluate the expressions.

Example: The DAG for the expression a+a*(b-c)+(b-c)*d by sequence of steps

The leaf for “a” has two parents, because a appears twice in the expression. More

interestingly, the two occurrences of the common subexpression b-c are represented by one node,

the node labeled “-“. That node has two parents, representing its two uses in the subexpressions

a*(b-c) and (b-c)*d. Even though b and c appear twice in the complete expression, their nodes each

have one parent, since both uses are in the common subexpression b-c.

Figure 5.10: Dag for the expression a + a * (b - c) + (b - c) * d

Table 5.1: Syntax-directed definition to produce syntax trees or DAG's

S. No. PRODUCTION SEMANTIC RULES

1) E E1 + T E.node = new Node('+', El .node, T.node)

2) E E1 - T E.node = new Node('-', El .node, T.node)

3) E T E.node = T.node

4) T ( E ) E.node = T.node

5) T id T.node = new Leaf (id, id. entry)

6) T num T. node = new Leaf (num, num. val)

The Syntax-directed definition (SDD) of Figure 5.10 can construct either syntax trees or

DAG's. It was used to construct syntax trees in Example 5.10, where functions Leaf and Node

created a fresh node each time they were called. It will construct a DAG if, before creating a new

node, these functions first check whether an identical node already exists. If a previously created

identical node exists, the existing node is returned. For instance, before constructing a new node,

Node(op, left, right) we check whether there is already a node with label op, and children left and

right, in that order. If so, Node returns the existing node; otherwise, it creates a new node.


1) pl = Leaf(id, entry-a)

2) p2 = Leaf(id, entry-a) = p1

3) p3 = Leaf(id, entry-b)

4) p4 = Leaf(id, entry-c)

5) p5 = Node('-', p3, p4)

6) p6 = Node('*', pl p5)

7) p7 = Node(' f ' p1, p6)

8) p8 = Leaf(id, entry-b) = p3

9) p9 = Leaf(id, entry-c) = p4

10) pl0 = Node('-', p3, p4) = p5

11) p11 = Leaf(id, entry-d)

12) p12 = Node('*', p5, p11)

13) p13 = Node('+', p7, pl2)

Assume that entry-a points to the symbol-table entry for a, and similarly for the other identifiers.

When the call to Leaf (id, entry-a) is repeated at step 2, the node created by the previous call is

returned, so p2 = pl. Similarly, the nodes returned at steps 8 and 9 are the same as those returned at

steps 3 and 4 (i.e., p8 = p3 and p9 = p4). Hence the node returned at step 10 must be the same at that

returned at step 5; i.e., p10 = p5.

5.3 OPTIMIZATION OF BASIC BLOCKS

We can often obtain a substantial improvement in the running time of code merely by

performing local optimization within each basic block by itself.

5.3.1 The DAG Representation of Basic Blocks

Many important techniques for local optimization begin by transforming a basic block into

a DAG (directed acyclic graph). The idea extends naturally to the collection of expressions that are

created within one basic block. We construct a DAG for a basic block as follows:

1. There is a node in the DAG for each of the initial values of the variables appearing in

the basic block.

2. There is a node N associated with each statement s within the block. The children of N

are those nodes corresponding to statements that are the last definitions, prior to s, of the

operands used by s.

3. Node N is labeled by the operator applied at s, and also attached to N is the list of

variables for which it is the last definition within the block.

4. Certain nodes are designated output nodes. These are the nodes whose variables are live

on exit from the block; that is, their values may be used later, in another block of the

flow graph. Calculation of these "live variables" is a matter for global flow analysis,

The DAG representation of a basic block lets us perform several code improving transformations

on the code represented by the block.

a) We can eliminate local common subexpressions, that is, instructions that compute a

value that has already been computed.

b) We can eliminate dead code, that is, instructions that compute a value that is never used.

c) We can reorder statements that do not depend on one another; such reordering may

reduce the time a temporary value needs to be preserved in a register.

d) We can apply algebraic laws to reorder operands of three-address instructions, and

sometimes thereby simplify the computation.


Example: Construct DAG from the basic block.

1 t1 = 4*i

2 t2 = a[t1]

3 t3 = 4*i

4 t4 = b[t3]

5 t5 = t2*t4

6 t6 = prod + t5

7 t7 = i+1

8 i = t7

9 if i<=20 goto 1

Figure 5.11: Step by step construction of DAG

*

4 i

[] t4

b

Final DAG

[] t2

a

* t5

* t6, prod

prod

t1, t3 *

1

t7, i

<=

20

* t1

4 i * t1

4 i

[] t2

a

Statement 1 Statement 2

* t1, t3

4 i

[] t2

a

Statement 3

* t1, t3

4 i

[] t4

b

Statement 4

[] t2

a

* t1, t3

4 i

[] t4

b

Statement 5

[] t2

a

* t5

*

4 i

[] t4

b

Statement 6,7

[] t2

a

* t5

* t6, prod

prod

*

4 i

[] t4

b

Statement 8,9

[] t2

a

* t5

* t6, prod

prod

t1, t3 t1, t3 *

1

t7, i


5.3.2 Finding Local Common Subexpressions

Common subexpressions can be detected by noticing, as a new node M is about to be added,

whether there is an existing node N with the same children, in the same order, and with the same

operator. If so, N computes the same value as M and may be used in its place.

Example 5.10 : A DAG for the block

a = b + c

b = a – d

c = b + c

d = a – d

is shown in Figure 5.11. When we construct the node for the third statement c = b + c, we

know that the use of b in b + c refers to the node of Figure 5.11 labeled -, because that is the most

recent definition of b. Thus, we do not confuse the values computed at statements one and three.

Figure 5.11: DAG for basic block

However, the node corresponding to the fourth statement d = a - d has the operator - and the

nodes with attached variables a and d0 as children. Since the operator and the children are the same

as those for the node corresponding to statement two, we do not create this node, but add d to the

list of definitions for the node labeled -.

If b is not live on exit from the block, then we do not need to compute that variable, and can

use d to receive the value represented by the node labeled -.

a = b + c

d = a – d

c = d + c

However, if both b and d are live on exit, then a fourth statement must be used to copy the

value from one to the other.'

Example 5.11 : When we look for common subexpressions, we really are looking for expressions

that are guaranteed to compute the same value, no matter how that value is computed. Thus, the

DAG method will miss the fact that the expression computed by the first and fourth statements in

the sequence

a = b + c

b = b – d

c = c + d

e = b + c


is the same, namely b0 + c0. That is, even though b and c both change between the first and

last statements, their sum remains the same, because b + c = (b - d) + (c + d). The DAG for this

sequence is shown in Fig. 5.12, but does not exhibit any common subexpressions.

Figure 5.12: DAG for basic block

5.3.3 Dead Code Elimination

The operation on DAG's that corresponds to dead-code elimination can be implemented as

follows. We delete from a DAG any root (node with no ancestors) that has no live variables

attached. Repeated application of this transformation will remove all nodes from the DAG that

correspond to dead code.

Example 5.12: If, in Fig. 5.11, a and b are live but c and e are not, we can immediately remove the

root labeled e. Then, the node labeled c becomes a root and can be removed. The roots labeled a

and b remain, since they each have live variables attached.

Figure 5.13: DAG after Dead Code Elimination

5.3.4 The Use of Algebraic Identities

Algebraic identities represent another important class of optimizations on basic blocks. For

example, we may apply arithmetic identities, such as to eliminate computations from a basic block.

Another class of algebraic optimizations includes local reduction in strength,

that is, replacing a more expensive operator by a cheaper one as in:

A third class of related optimizations is constant folding. Here we evaluate constant

expressions at compile time and replace the constant expressions by their value. Thus the

expression 2 * 3.14 would be replaced by 6.28. Many constant expressions arise in practice because

of the frequent use of symbolic constants in programs.


5.3.5 Representation of Array References

Consider for instance the sequence of three address statements: = �[ ] �[ ] = = �[ ] The above code can be "optimized" by replacing the third instruction = �[ ] by the simpler z =

x. However, the first statement cannot be optimized.

The proper way to represent array accesses in a DAG is as follows.

1. An assignment from an array, like x = a [i], is represented by creating a node with operator

=[ ] and two children representing the initial value of the array, a0 in this case, and the index

i. Variable x becomes a label of this new node.

2. An assignment to an array, like a[jl = y, is represented by a new node with operator [ ] = and

three children representing ao, j and y. There is no variable labeling this node. What is

different is that the creation of this node kzlls all currently constructed nodes whose value

depends on a0. A node that has been killed cannot receive any more labels; that is, it cannot

become a common subexpression.

Example 5.11 : The DAG for the basic bloc

= �[ ] �[ ] = = �[ ]

The node N for x is created first, but when the node labeled [ ] = is created, N is killed.

Thus, when the node for z is created, it cannot be identified with N, and a new node with the same

operands a0 and i0 must be created instead.

Figure 5.12: The DAG for a sequence of array assignments

Example 5.12 : Sometimes, a node must be killed even though none of its children have an array

like a0 in Example 5.11 as attached variable. Likewise, a node can kill if it has a descendant that is

an array, even though none of its children are array nodes. For instance, consider the three-address

code

b = 12 + a

x = b[i]

b[j] = y

= [ ] z

= [ ] x

Killed

a0 a0 j0 y0

= [ ]


What is happening here is that, for efficiency reasons, b has been defined to be a position in

an array a. For example, if the elements of a are four bytes long, then b represents the fourth

element of a. If j and i represent the same value, then b [i] and b[j] represent the same location.

Therefore it is important to have the third instruction, b[j] = y, kill the node with x as its attached

variable.

Figure 5.13: A node that kills a use of an array need not have that array as a child

However, as we see in Fig. 5.13, both the killed node and the node that does the killing have

a0 as a grandchild, not as a child.

5.3.6 Pointer Assignments and Procedure Calls

When we assign indirectly through a pointer, as in the assignments

= ∗

∗ = we do not know what p or q point to. In effect, x = *p is a use of every variable whatsoever, and *q

= y is a possible assignment to every variable. As a consequence, the operator =* must take all

nodes that are currently associated

with identifiers as arguments, which is relevant for dead-code elimination. More importantly, the

*= operator kills all other nodes so far constructed in the DAG.

There are global pointer analyses one could perform that might limit the set of variables a

pointer could reference at a given place in the code. Even local analysis could restrict the scope of a

pointer. For instance, in the sequence

= &

∗ = we know that x, and no other variable, is given the value of y, so we don't need to kill any node but

the node to which x was attached.

Procedure calls behave much like assignments through pointers. In the absence of global data-flow

information, we must assume that a procedure uses and changes any data to which it has access.

Thus, if variable x is in the scope

of a procedure P, a call to P both uses the node with attached variable x and kills that node.

5.3.7 Reassembling Basic Blocks from DAG's

After we perform whatever optimizations are possible while constructing the DAG or by

manipulating the DAG once constructed, we may reconstitute the three-address code for the basic

block from which we built the DAG. For each node that has one or more attached variables, we

= [ ]

+

Killed

12 a0 j0 y0

= [ ]

i0


construct a three-address statement that computes the value of one of those variables. We prefer to

compute the result into a variable that is live on exit from the block. However, if we do not have

global live-variable information to work from, we need to assume that every variable of the

program (but not temporaries that are generated by the compiler to process expressions) is live on

exit from the block.

If the node has more than one live variable attached, then we have to introduce copy

statements to give the correct value to each of those variables. Sometimes, global optimization can

eliminate those copies, if we can arrange to use one of two variables in place of the other.

Example 8.15: consider again Example 5.11, if b is not live on exit from the block, then the three

statements

a = b + c

d = a – d

c = d + c

suffice to reconstruct the basic block. The third instruction, c = d + c, must use d as an operand

rather than b, because the optimized block never computes b.

If both b and d are live on exit, or if we are not sure whether or not they are live on exit,

then we need to compute b as well as d. We can do so with the sequence

a = b + c

d = a – d

b = d

c = d + c

This basic block is still more efficient than the original. Although the number of instructions is the

same, we have replaced a subtraction by a copy, which tends to be less expensive on most

machines. Further, it may be that by doing a global analysis, we can eliminate the use of this

computation of b outside the block by replacing it by uses of d. In that case, we can come back to

this basic block and eliminate b = d later. Intuitively, we can eliminate this copy if wherever this

value of b is used, d is still holding the same value. That situation may or may not be true,

depending on how the program recomputes d.

5.4 GLOBAL DATA FLOW ANALYSIS

Global data flow analysis collects the information about the entire program and distribute

this information to each block in the flow graph. Data-flow information can be collected by setting

up and solving systems of equations that relates information at various points in a program. These

equations are termed as data flow equations. A typical data flow equation has the form

Out[S] = gen[S] ∪ (in[S] – kill[S])

Where

gen[S] = Definitions within B that reach the end of B.

in[S] = Definitions that reach B’s entry. kill[S] = Definitions that never reach the end of B due to redefinitions of variables in B..

Out[S] = Definitions that reach B’s exit.

Paths and points

A definition point is a point in a program at which definition is carried out for a variable.

A reference point is a point in a program at which a reference to a data item is made.

An evaluation point is a point in a program at which expression is evaluated completely.


The number of points in a basic block is calculated as follows:

A point calculated between two adjacent statements in a block.

A point before the first statement of the block

A point after the last statement of the block

Example 8.16: Find the number of points in the basic block

a = b + c

b = e + u

c = 8 * b

Number of points between two adjacent statements in the block = 2

Number of points before the first statement of the block = 1

Number of points after the last statement of the block = 1

Total number of points = 2 + 1 + 1 = 4 points

A path from p1 to pn is a sequence of pints p1, p2, . . . , pn such that for each i between 1 and

n - 1 , either

1. pi is the point immediately preceding a statement and pi+1 is the point immediately

following that statement in the same block, or

2. pi is the end of some block and pi+1 is the beginning of a successor block.

Reaching Definitions

A definition d of a variable x: A definition d of a variable x is a statement that assigns a

value to x. other kinds of statements (procedure call or pointer) assignment define a value

for variable x are called ambiguous definitions.

Use of variable x: The use of variable x means the value of x is referenced in expression

evaluation.

Reachability: Definition d of a variable x reaches a point p if there is a path from the point

immediately following d to p, such that d is not "killed" along that path.

Killing a variable: Definition d of a variable x is killed when there is a redefinition for the

variable x.

Live variable: A variable x is live at some point p if there is a path from p to exit, along

which the value of x is used before it is redefined. Otherwise the variable is said to be dead

at that point.

x =3

y =x +5

z = x+ y

Definition point for a variable x

Reference point for a variable x

Evaluation point for a variable z


Example 8.17: Find the reach ability of variable x.

Figure 5.15: Reaching definitions

Data flow analysis of structured Programs

Flow graphs for control-flow constructs such as if-else and do-while statements have a

useful property; there is a single beginning pint at which control enters and a single end point that

control leaves from when execution of the statement is over.

Figure 5.16: Structured control constructs

Conservative Estimation of Data-Flow Information

Optimizations applied to the code must be safe. i.e., the data flow facts computed should

definitely be true.

Two main reasons that cause results of analysis to be conservative:

1. Control flow: The data flow equations are generated based on the assumption

that all paths are executable, but in practical it will execute one path in if then

else control.

2. Pointers and aliasing: The value of the pointer may not known in advance to

the programmar.

The definitions reaching: the beginning and end of statements with the following syntax given

below

S id = E | S ; S | if E then S else S | do S while E

E id + id | id

S1

S2

S1 ; S2

S1

S1

IF E then S1 else S2

S2

If E goto S1

If E goto S1

do S1 while E

Variable x is defined in B1. x = 5

t = 3 y = 7

w = t+10 t = x + 5

x = t + w

b = 15

B1

B3 B2

B4 B5

B6

B7

Variable x is live in B1, B3, and B4

Variable x killed at B6 (by redefinition)

Variable x is used in B4

Variable x is reachable to B4 via B3 (not killed in B3)

Variable t can reaches from B3 to B6 via B5 or B4

In path1 (B3-B4-B6), t killed.

In path2 (B3-B5-B6), t used.


Figure 5.17: Data- flow equations for reaching definitions

Representation of sets

The set of definitions for gen[S] and kill[S] can be represented by bit vectors.

The bit vector is assigned 1 to a position I, if the definition numbered I is present in the set.

This can be taken as the index of the statement.

The bit vector representation allows set operations to be implemented efficiently.

Consider the code

…

j= j-1 /* d5 */

if e1 then

a = u2 /* d6 */

else

i = u3 /* d7 */

Figure 5.17: Set representation and bit vector representation for gen[] and kill[].

if

e1 d1 d2

Gen ={d6, d7}

Kill ={}

Gen ={d6, d7}

Kill ={d5}

Gen ={ d7}

Kill ={d1, d4}

if

e1 d1 d2

0000011

0000000

0000011

0000100

0000001

1001000

S1

S1

S1

S1 S1

d: a=b+c S (a)

S (b)

S (c)

S (d)

gen[S] = {d}

kill[S] = Da –{d}

out[S] = gen[S] ∪ (in[S] - kill[S])

gen[S] = gen[S] ∪ ( gen[S1] - kill[S2])

kill[S] = kill[S] ∪ ( kill[S1] - gen[S2])

in[S1] = in[S]

in[S2] = in[S1]

out[S] = out[S2]

gen[S] = gen[S1] ∪ gen[S2]

kill[S] = kill[S1] ∩ kill[S2]

in[S1] = in[S]

in[S2] = in[S]

out[S] = out[S1] ∪out[S2]

gen[S] = gen[S1]

kill[S] = kill[S1]

in[S1] = in[S] ∪gen[S1] '

in[S2] = in[S]

out[S] = out[S1]


5.5 EFFICIENT DATA FLOW ALGORITHMS

Data-flow analysis speed can be increased by the following two algorithms

1. Depth-First Ordering in iterative Algorithms:

2. Structure-based Data-Flow Analysis.

The first is an application of depth-first ordering to reduce the number of 'passes that the

iterative algorithm takes, and the second uses intervals or the T1and T2 transformations to

generalize the syntax-directed approach.

Depth-First Ordering in iterative Algorithms

Reaching definitions. Available expressions, or live variables, any event of significance at a

node will be propagated to that node along an acyclic path.

Iterative algorithms can be used to track their acyclic nature.

If a definition d is in in[B] then there is some acyclic path from the block containing d to B

such that d is in the in's and out's all along that path.

If an expression x+y is not available at the entrance to block B, then there is some acyclic

path that demonstrates that fact; either the path is from the initial node and includes no

statement that kills or generates x+y, or the path is from a block that kills x+y and along the

path there is no subsequent generation of x+y.

For live variables. if x is live on exit from block B, then there is an acyclic path from B to a

use of x, along with there are no definitions of x.

If a use of x is reached from the end of block B along a path with a cycle, we can eliminate

that cycle to find a shorter path along which the use of x is still reached from B.

Procedure

1. First visit the root node of the tree. Eg. (1)

2. If no root node present, then visit the first right hand side node. Eg. (1)

3. After reaching depth visit the missed node by visiting their parent node.

Figure 5.18: Depth first traversal for the given tree.

The order of visiting the edges in the above tree is:

1 3 4 6 7 8 10 8 9 8 7 6 4 5 4 3 1 21

Steps:

After node 4, there is confusion, either 5 or 6, we considered 6.

After visiting node 10, back tract to 8 to visit 9.

The definition d from Out[1] will reach In[3] and Out[3] will reach In[4] and so on.

1

3

2

4

5 6

7

8

10 9

3

2

4

5 6

7

8

10 9


Structure-based Data-Flow Analysis

We can implement data-flow algorithms that visit nodes no more times than the interval

depth of the flow graph. The ideas exposed here apply to syntax-directed data-flow algorithms For

all sorts of structured control statements.

This algorithm focus on multiple exists in the blocks.

Gen R, B indicates the definition that was generated in the region R of the basic block B.

Kill R, B indicates the definition that was killed in the region R of the basic block B.

The transfer function Trans R, B (S) of definition set S is set of definitions reach the end

of block B by traveling along paths wholly within R.

The definitions reaching the end of block B fall into two classes.

1. Those that are generated within R and propagate to the end of B independent of S.

2. Those that are not generated in R, but that also are not killed along some path from

the header of R to the end of B, and therefore are in Trans R, B (S) if and only if they

are in S.

Thus, we may write trans in the form:

Trans R, B (S) = Gen R,B ⋃ (S – Kill R,B)

Case 1:

If the transformation does not alter any definition I the basic block B, then the transfer

function of region R, is same as the transfer function of Block B.

Gen B, B = Gen[B]

kill B, B = Kill[B]

Case 2:

The region R is formed when R1 consumes R2. There are no edges from R2 to R1.

Header of R is the header of R1. The R2 does not affect the transfer function of R1.

Gen R, B = Gen R1, B

kill R, B = kill R1, B for all B in R1.

Figure 5.19: Region building by T2

For B in R2, a definition can reach the end of B if any of the following conditions hold:

1. The definition is generated within R2.

2. The definition is generated within R1 reaches the end of some predecessor of the header

of R2, and is not killed going from the header of R2 to B.

3. The definition is in the set S available at the header of R1, not killed going to some

predecessor of the header of R2, and not killed going from the header of R2 to B.

R1

R2

R

. . . . . .

. . .


5.6 ISSUES IN DESIGN OF A CODE GENERATOR

The most important criterion for a code generator is that it produce correct code. The

following issues arises during the code generation phase.

1 Input to the Code Generator

2 The Target Program

3 Memory Management

4 Instruction Selection

5 Register Allocation

6 Evaluation Order

1 Input to the Code Generator

The input to the code generator is the intermediate representation (IR) of the source

program produced by the front end, along with information in the symbol table that is used to

determine the run-time addresses of the data objects denoted by the names in the IR.

The choice for the intermediate representation includes the following:

Three-address representations such as quadruples, triples, indirect triples;

Virtual machine representations such as bytecodes and stack-machine code;

Linear representations such as postfix notation.

Graphical representations such as syntax trees and DAG's.

The front end has scanned, parsed, and translated the source program into a relatively low-

level IR, so that the values of the names appearing in the IR can be represented by quantities that

the target machine can directly manipulate, such as integers and floating-point numbers.

All syntactic and static semantic errors have been detected, that the necessary type checking

has taken place, and that type-conversion operators have been inserted wherever necessary. The

code generator can proceed on the assumption that its input is error free.

2 The Target Program

The output of the code generator is the target program which is going to run in the

following computers.

The instruction-set architecture of the target machine has a significant impact on the

difficulty of constructing a good code generator that produces high-quality machine code.

The most common target-machine architectures are RISC (reduced instruction set

computer), CISC (complex instruction set computer), and stack based.

A RISC machine typically has many registers, three-address instructions, simple addressing

modes, and a relatively simple instruction-set architecture. In contrast, a CISC machine

typically has few registers, two-address instructions, a variety of addressing modes, several

register classes, variable-length instructions, and instructions with side effects.

In a stack-based machine, operations are done by pushing operands onto a stack and then

performing the operations on the operands at the top of the stack. To achieve high

performance the top of the stack is kept in registers.

The JVM is a software interpreter for Java bytecodes, an intermediate language produced

by Java compilers. The interpreter provides software compatibility across multiple

platforms, a major factor in the success of Java. To improve the high performance

interpretation just-in-time (JIT) Java compilers have been created.

The output of the code generator may be:

Absolute machine language program: It can be placed in a fixed memory location and

immediately executed.


Reloadable machine-language program: It allows subprograms (object modules) to be

compiled separately. A set of relocatable object modules can be linked together and loaded

for execution by a linking loader. the compiler must provide explicit relocation information

to the loader if automatic relocation is not possible.

Assembly language program: The process of code generation is somewhat easier, but

assembly must be converted into machine executable code with help of assembler.

3 Memory Management

Names in the source program are mapped to addresses of data objects in run-time memory

by both the front end and code generator.

Memory Management uses symbol table to get names information.

The amount of memory required by declared identifies are calculated and storage space is

reserved in memory at run time.

Labels in three address code are converted into equivalent memory address.

For instance if a reference to “goto j” is encountered in three address code then appropriate

jump instruction can be generated by computing memory address for label j.

Some instruction address can be calculated in run time only that is also after loading the

program.

4 Instruction Selection

The code generator must map the IR program into a code sequence that can be executed by the

target machine. The complexity of performing this mapping is determined by a factors such as

The level of the intermediate representation (IR).

The nature of the instruction-set architecture.

The desired quality of the generated code.

If the IR is high level, the code generator may translate each IR statement into a sequence of

machine instructions using code templates. Such statement-by-statement code generation, however,

often produces poor code that needs further optimization. If the IR reflects some of the low-level

details of the underlying machine, then the code generator can use this information to generate

more efficient code sequences.

The uniformity and completeness of the instruction set are important factors. The selection

of instruction depends on the instruction set of the target machine. Instruction speeds and machine

idioms are other important factors in selection of instruction.

If we do not care about the efficiency of the target program, instruction selection is

straightforward. For each type of three-address statement, we can design a code skeleton that

defines the target code to be generated for that construct.

For example, every three-address statement of the form x = y + z, where x, y, and z are

statically allocated, can be translated into the code sequence

LD R0, y // R0 = y (load y into register RO)

ADD R0, R0, z // R0 = R 0 + z (add z to R0)

ST x, R0 // x = R0 (store RO into x)

This strategy often produces redundant loads and stores. For example, the sequence of three-

address statements

a = b + c

d = a + e

would be translated into the following code

LD R0, b // R0 = b

ADD R0, R0, c // R0 = R0 + c

ST a, R0 // a = R0


LD R0, a // R0 = a

ADD R0, R0, e // R0 = R0 + e

ST d, R0 // d = R0

Here, the fourth statement is redundant since it loads a value that has just been stored, and so is the

third if a is not subsequently used.

The quality of the generated code is usually determined by its speed and size. On most

machines, a given IR program can be implemented by many different code sequences, with

significant cost differences between the different implementations. A naive translation of the

intermediate code may therefore lead to correct but unacceptably inefficient target code.

5 Register Allocation

A key problem in code generation is deciding what values to hold in what registers.

Registers are the fastest computational unit on the target machine, but we usually do not have

enough of them to hold all values. Values not held in registers need to reside in memory.

Instructions involving register operands are invariably shorter and faster than those involving

operands in memory, so efficient utilization of registers is particularly important.

The use of registers is often subdivided into two subproblems:

1. Register allocation: During register allocation, select the appropriate set of variables that

will reside in registers at each point in the program.

2. Register assignment: During register assignment, pick the specific register in which

corresponding variable will reside in.

Finding an optimal assignment of registers to variables is difficult, even with single-register

machines. Mathematically, the problem is NP-complete. The problem is further complicated

because the hardware and/or the operating system of the target machine may require that certain

register-usage conventions be observed. Certain machines require register-pairs for some operands

and results.

Consider the two three-address code sequences, the only difference is the operator in the second

statement.

t = a + b

t = t * c

t = t / d

The efficient Optimal machine-code sequences with only one register R0

LD R0, a

ADD R0, b

MUL R0, c

DIV R0, d

ST R0, t

6 Evaluation Order

The evaluation order is an important factor in generating an efficient target code. Some

computation orders require fewer registers to hold intermediate results than others. However,

picking a best order in the general case is a difficult NP-complete problem. We can avoid the

problem by generating code for the three-address statements in the order in which they have been

produced by the intermediate code generator.


5.7 A SIMPLE CODE GENERATOR ALGORITHM

A code generator generates target code for a sequence of three-address instructions. One of

the primary issues during code generation is deciding how to use registers to best advantage. Best

target code will use minimum registers in execution.

There are four principal uses of registers:

The operands of an operation must be in registers in order to perform the operation.

Registers make good temporaries - used only within a single basic block.

Registers are used to hold (global) values that are computed in one basic block and used

in other blocks

Registers are often used to help with run-time storage management

The machine instructions are of the form

LD reg, mem

ST mem, reg

OP reg, reg, reg

Register and Address Descriptors

For each available register, a register descriptor keeps track of the variable names

whose current value is in that register. Initially all register descriptors are empty. As the

code generation progresses, each register will hold the value.

For each program variable, an address descriptor keeps track of the location or

locations where the current value of that variable can be found. The location might be a

register, a memory address, a stack location, or some set of more than one of these. The

information can be stored in the symbol-table entry for that variable name.

Function GetReg

An essential part of the algorithm is a function getReg(I), which selects registers for

each memory location associated with the three-address instruction I.

Function getReg has access to the register and address descriptors for all the variables of

the basic block, and may also have access to certain useful data-flow information such

as the variables that are live on exit from the block.

In a three-address instruction such as x = y + z, A possible improvement to the

algorithm is to generate code for both x = y + z and x = z + y whenever + is a

commutative operator, and pick the better code sequence.

Machine Instructions for Operations For a three-address instruction with Operations (+, - , * , / , …) such as x = y + z, do the following:

1. Use getReg(x = y + z) to select registers for x, y, and z. Call these Rx, Ry, and Rz.

2. If y is not in Ry (according to the register descriptor for Ry), then issue an instruction LD

Ry, y', where y' is one of the memory locations for y (by the address descriptor for y).

3. Similarly, if z is not in Rz, issue and instruction LD Rz, z', where z' is a location for z .

4. Issue the instruction ADD Rx, Ry, Rz.

Machine Instructions for Copy Statements

Consider an important special case: a three-address copy statement of the form x = y.

We assume that getReg will always choose the same register for both x and y. If y is not

already in that register Ry, then generate the machine instruction LD Ry, y.

If y was already in Ry, we do nothing.


Managing Register and Address Descriptors

As the code-generation algorithm issues load, store, and other machine instructions, it needs

to update the register and address descriptors. The rules are as follows:

1. For the instruction LD R, x

a. Change the register descriptor for register R so it holds only x.

b. Change the address descriptor for x by adding register R as an additional location.

2. For the instruction ST x, R, change the address descriptor for x to include its own

memory location.

3. For an operation OP R, R, R such as ADD Rx, Ry, and Rz implementing a three-address

instruction x = y + x

a. Change the register descriptor for Rx so that it holds only x.

b. Change the address descriptor for x so that its only location is Rx. Note that the

memory location for x is not now in the address descriptor for x.

c. Remove Rx from the address descriptor of any variable other than x.

4. When we process a copy statement x = y, after generating the load for y into register Ry,

if needed, and after managing descriptors as for all load statements (per rule I):

a. Add x to the register descriptor for Ry.

b. Change the address descriptor for x so that its only location is Ry.

Example 5.16 : Let us translate the basic block consisting of the three-address statements

t = a – b

u = a – c

v = t + u

a = d

d = v + u where t, u, and v are temporaries, local to the block, while a, b, c, and d are

variables that are live on exit from the block. When a register's value is no longer needed, then we

reuse its register. A summary of all the machine-code instructions generated is in Figure.


SUMMARY

THE PRINCIPAL SOURCES OF OPTIMIZATION

Semantics Preserving Transformations (Functions ) - Safeguards original program meaning

Global Common Subexpressions

Copy Propagation

Dead-Code Elimination

Code Motion / Movement

Induction Variables and Reduction in Strength

OPTIMIZATION OF BASIC BLOCKS

The DAG Representation of Basic Blocks

Finding Local Common Subexpressions

Dead Code Elimination

The Use of Algebraic Identities

Representation of Array References

Pointer Assignments and Procedure Calls

Reassembling Basic Blocks from DAG's

GLOBAL DATA FLOW ANALYSIS

Paths and points


Data flow analysis of structured Programs

Conservative Estimation of Data-Flow Information

Representation of sets

EFFICIENT DATA FLOW ALGORITHMS

Depth-First Ordering in iterative Algorithms

Structure-based Data-Flow Analysis

ISSUES IN THE DESIGN OF A CODE GENERATOR

Input to the Code Generator

The Target Program

Memory Management

Instruction Selection

Register Allocation

Evaluation Order

PEEPHOLE OPTIMIZATION

Eliminating Redundant Loads and Stores

Eliminating Unreachable Code

Flow-of-Control Optimizations

Algebraic Simplification and Reduction in Strength

Use of Machine Idioms

DATA-FLOW ANALYSIS

The Data-Flow Abstraction

The Data-Flow Analysis Schema

Data-Flow Schemas on Basic Blocks


Live-Variable Analysis

Available Expressions

LOOP OPTINIZATION

Code Motion while(i<max-1) {sum=sum+a[i]} => n= max-1; while(i<n) {sum=sum+a[i]}

Induction Variables and Strength Reduction: only one Induction Variable in loop, either i++ or j=j+2, * by +

Loop invariant method

Loop unrolling

Loop fusion

COMPILE TIME EVALUATION

Constant folding: Computation of constant done at compile time, E.g. Clength=2*(22/7)*r.

Constant propagation: Value of variable is replaced and computed at compile time.

E.g. pi=3.14; r=6; Area=pi*r*r;, then, Area is computed as 3.14*6*6.

Variable propagation: one variable is replaced by another at compile time.

E.g. x=pi; Area=x*r*r;, then, Area is computed as pi*r*r.

1

DR. PAULS ENGINEERING COLLEGE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III / VI

Subject Code : CS6660

Subject Name : COMPILER DESIGN

Degree & Branch : B.E C.S.E.


1. What is a compiler? A compiler is a program that reads a program written in one language –the source language and translates it into an equivalent program in another language-the target language. The compiler reports to its user the presence of errors in the source program.

2. What are the two parts of a compilation? Explain briefly. Analysis and Synthesis are the two parts of compilation. The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. The synthesis part constructs the desired target program from the intermediate representation.

3. List the subparts or phases of analysis part. Analysis consists of three phases:

Linear Analysis. Hierarchical Analysis. Semantic Analysis.

4. Depict diagrammatically how a language is processed. Skeletal source programPreprocessorSource programCompilerTarget assembly program AssemblerRelocatable machine codeLoader/ link editor ←library, relocatable object files Absolute machine code

5. What is linear analysis? Linear analysis is one in which the stream of characters making up the source program is read from left to right and grouped into tokens that are sequences of characters having a collective meaning. Also called lexical analysis or scanning.

6. List the various phases of a compiler. The following are the various phases of a compiler:

Lexical Analyzer Syntax Analyzer Semantic Analyzer Intermediate code generator Code optimizer Code generator

2

7. What are the classifications of a compiler? Compilers are classified as:

- pass -pass -and-go

8. What is a symbol table? A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. Whenever an identifier is detected by a lexical analyzer, it is entered into the symbol table. The attributes of an identifier cannot be determined by the lexical analyzer.

9. Mention some of the cousins of a compiler. Cousins of the compiler are:

Preprocessors Assemblers Loaders and Link-Editors

10. List the phases that constitute the front end of a compiler. The front end consists of those phases or parts of phases that depend primarily on the source language and are largely independent of the target machine. These include

Lexical and Syntactic analysis The creation of symbol table Semantic analysis Generation of intermediate code

A certain amount of code optimization can be done by the front end as well. Also includes error handling that goes along with each of these phases.

11. Mention the back-end phases of a compiler. The back end of compiler includes those portions that depend on the target machine and generally those portions do not depend on the source language, just the intermediate language. These include

Code optimization Code generation, along with error handling and symbol- table operations.

12. Define compiler-compiler. Systems to help with the compiler-writing process are often been referred to as compiler-compilers, compiler-generators or translator-writing systems. Largely they are oriented around a particular model of languages , and they are suitable for generating compilers of languages similar model.

3

13. List the various compiler construction tools. The following is a list of some compiler construction tools:

Scanner generators [Lexical Analysis]

Parser generators [Syntax Analysis]

Syntax-directed translation engines [Intermediate Code]

Data-flow analysis engines [Code Optimization]

Code-generator generators [Code Generation]

Compiler-construction toolkits [For all phases]

14. List out language processors

(i) Compiler

(ii) Interpreter

(iii) Hybrid Compiler

(iv) Language processing system (Preprocessors, Assemblers, Linkers and Loader )

15. List out some programming language basics.

To design an efficient compiler we should know some language basics. Important concepts from popular programming languages like C, C++, C#, and Java are listed below.

Some of the Programming Language basics which are used in most of the languages are listed below. They are:

The Static/Dynamic Distinction

Environments and States

Static Scope and Block Structure

Explicit Access Control

Dynamic Scope

Parameter Passing Mechanisms

Aliasing

4


1. Write the Needs / Roles / Functions of lexical analyzer

It produces stream of tokens.

It eliminates comments and whitespace.

It keeps track of line numbers.

It reports the error encountered while generating tokens.

It stores information about identifiers, keywords, constants and so on into symbol table.

2. Differentiate tokens, patterns, lexeme. - Sequence of characters that have a collective meaning. - There is a set of strings in the input for which the same token is produced

as output. This set of strings is described by a rule called a pattern associated with the token

me- A sequence of characters in the source program that is matched by the pattern for a token.

2. List the operations on languages. Union - L U M ={s | s is in L or s is in M} Concatenation – LM ={st | s is in L and t is in M} Kleene Closure – L* (zero or more concatenations of L) Positive Closure – L+ ( one or more concatenations of L)

3. Write a regular expression for an identifier. An identifier is defined as a letter followed by zero or more letters or digits. The regular expression for an identifier is given as letter (letter | digit)*

4. Mention the various notational shorthands for representing regular expressions.

regular

expressions a | b | c.)

5. What is the function of a hierarchical analysis? Hierarchical analysis is one in which the tokens are grouped hierarchically into nested collections with collective meaning. Also termed as Parsing.

6. What does a semantic analysis do? Semantic analysis is one in which certain checks are performed to ensure that components of a program fit together meaningfully. Mainly performs type checking.

7. List the various error recovery strategies for a lexical analysis.

5

Possible error recovery actions are:

10. Define nullable(n), firstpos(n), lastpos(n) and followpos(p)

1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented by n has ε in its language. That is, the subexpression can be "made null" or the empty string, even though there may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n that correspond to the first symbol of at least one string in the language of the subexpression rooted at n.

3. lastpos(n) is the set of positions in the subtree rooted at n that correspond to the last symbol of at least one string in the language of the subexpression rooted at n.

4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that there is some string x = a1a2 …an in L((r)#) such that for some i, there is a way to explain the membership of x in L((r)#) by matching ai to position p of the syntax tree and ai+1 to position q.

6

12. Write the algorithm for Converting a Regular Expression Directly to a DFA

Algorithm: Construction of a DFA from a regular expression r.

INPUT: A regular expression r.

OUTPUT: A DFA D that recognizes L(r).

METHOD:

1. Construct a syntax tree T from the augmented regular expression (r)#. 2. Compute nullable, firstpos, lastpos, and followpos for T. 3. Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D,

By

the above procedure. The states of D are sets of positions in T. Initially, each state is

"unmarked," and a state becomes "marked" just before we consider its out-transitions. The

start state of D is firstpos(no), where node no is the root of T. The accepting states are those

containing the position for the endmarker symbol #.

13. Write the Structure of Lex Programs

A Lex program has the following form:

declarations

%%

translation rules

initialize Dstates to contain only the unmarked state firstpos(no),

where no is the root of syntax tree T for (r)#;

while ( there is an unmarked state S in Dstates )

{

mark S;

for ( each input symbol a )

{

let U be the union of followpos(p) for all p in S that correspond to a;

if ( U is not in Dstates )

7

14. Construct a DFA and firstpos and lastpos for nodes for the regular expression r = (a|b)*abb

8


1. Define parser. Hierarchical analysis is one in which the tokens are grouped hierarchically into nested collections with collective meaning. Also termed as Parsing. 2. Mention the basic issues in parsing.

There are two important issues in parsing.

3. Why lexical and syntax analyzers are separated out?

Reasons for separating the analysis phase into lexical and syntax analyzers: Simpler design. Compiler efficiency is improved. Compiler portability is enhanced.

4. Define a context free grammar. A context free grammar G is a collection of the following

G can be represented as G = (V,T,S,P) Production rules are given in the following form Non terminal → (V U T)*

5. Briefly explain the concept of derivation. Derivation from S means generation of string w from S. For constructing derivation two things are important.

i) Choice of non terminal from several others. ii) Choice of rule from production rules for corresponding non terminal.

Instead of choosing the arbitrary non terminal one can choose i) either leftmost derivation – leftmost non terminal in a sentinel form ii) or rightmost derivation – rightmost non terminal in a sentinel form

6. Define ambiguous grammar. A grammar G is said to be ambiguous if it generates more than one parse tree for some sentence of language L(G). i.e. both leftmost and rightmost derivations are same for the given sentence.

7. What is a operator precedence parser? A grammar is said to be operator precedence if it possess the following properties: 1. No production on the right side is ε. 2. There should not be any production rule possessing two adjacent non terminals at the right hand side.

8. List the properties of LR parser. 1. LR parsers can be constructed to recognize most of the programming

languages for which the context free grammar can be written. 2. The class of grammar that can be parsed by LR parser is a superset of

class of grammars that can be parsed using predictive parsers.

9

3. LR parsers work using non backtracking shift reduce technique yet it is efficient one.

9. Mention the types of LR parser. - simple LR parser

- lookahead LR parser

10. What are the problems with top down parsing? The following are the problems associated with top down parsing:

ty 11. Write the algorithm for FIRST and FOLLOW.

FIRST 1. If X is terminal, then FIRST(X) IS {X}. β. If X → ε is a production, then add ε to FIRST(X). γ. If X is non terminal and X → Y1,Yβ..Yk is a production, then place a in FIRST(X) if for

some i , a is in FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1); FOLLOW

1. Place $ in FOLLOW(S),where S is the start symbol and $ is the input right endmarker. β. If there is a production A → αB , then everything in FIRST( ) except for ε is placed in FOLLOW(B). 3. If there is a production A → αB, or a production A→ αB where FIRST( ) contains ε , then everything in FOLLOW(A) is in FOLLOW(B).

12. List the advantages and disadvantages of operator precedence parsing. Advantages This typeof parsing is simple to implement. Disadvantages 1. The operator like minus has two different precedence(unary and binary).Hence it is hard to handle tokens like minus sign. 2. This kind of parsing is applicable to only small class of grammars.

13. What is dangling else problem? Ambiguity can be eliminated by means of dangling-else grammar which is show below: stmt → if expr then stmt | if expr then stmt else stmt | other

14. Write short notes on YACC. YACC is an automatic tool for generating the parser program. YACC stands for Yet Another Compiler Compiler which is basically the utility available from UNIX. Basically YACC is LALR parser generator. It can report conflict or ambiguities in the form of error messages.

15. What is meant by handle pruning? A rightmost derivation in reverse can be obtained by handle pruning. If w is a sentence of the grammar at hand, then w = n, where n is the nth right-sentential form of some as yet unknown rightmost derivation

10

S = 0 => 1…=> n-1 => n = w 16. Define LR(0) items.

An LR(0) item of a grammar G is a production of G with a dot at some position of the right side. Thus, production A → XYZ yields the four items A→.XYZ A→X.YZ A→XY.Z A→XYZ.

17. What is meant by viable prefixes? The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes. An equivalent definition of a viable prefix is that it is a prefix of a right sentential form that does not continue past the right end of the rightmost handle of that sentential form.

18. Define handle. A handle of a string is a substring that matches the right side of a production, and whose reduction to the nonterminal on the left side of the production represents one step along the reverse of a rightmost derivation. A handle of a right – sentential form is a production A→ and a position of where the string may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of . That is , if S =>αAw =>α w,then A→ in the position following α is a handle of α w.

19. What are kernel & non-kernel items? Kernel items, whish include the initial item, S'→ .S, and all items whose dots are not at the left end. Non-kernel items, which have their dots at the left end.

20. What is phrase level error recovery? Phrase level error recovery is implemented by filling in the blank entries in the predictive parsing table with pointers to error routines. These routines may change, insert, or delete symbols on the input and issue appropriate error messages. They may also pop from the stack.

11


1. What are the benefits of intermediate code generation? A Compiler for different machines can be created by attaching different back end

to the existing front ends of each machine. A Compiler for different source languages can be created by proving different

front ends for corresponding source languages t existing back end. A machine independent code optimizer can be applied to intermediate code in

order to optimize the code generation. 2. What are the various types of intermediate code representation? There are mainly three types of intermediate code representations.

3. Define backpatching.

Backpatching is the activity of filling up unspecified information of labels using appropriate semantic actions in during the code generation process.In the semantic actions the functions used are mklist(i),merge_list(p1,p2) and backpatch(p,i)

4. Mention the functions that are used in backpatching.

function where I is an index to the array of quadruple. p2) this function concatenates two lists pointed by p1 and p2. It

returns the pointer to the concatenated list.

5. What is the intermediate code representation for the expression a or b and not c? The intermediate code representation for the expression a or b and not c is the three address sequence t1 := not c t2 := b and t1 t3 := a or t2

6. What are the various methods of implementing three address statements? The three address statements can be implemented using the following methods.

operator(OP),arg1,arg2,result.

the symbol table. ing pointers are

used instead of using statements. 7. Give the syntax-directed definition for if-else statement.

1. S → if E then S1 E.true := new_label() E.false :=S.next S1.next :=S.next S.code :=E.code | | gen_code(E.true ‘: ‘) | | S1.code

β. S → if E then S1 else S2 E.true := new_label()

13

UNIT V CODE OPTIMIZATION AND CODE GENERATION

1. Mention the properties that a code generator should possess.

words, the code generated should be such that it should make effective use of the resources of the target machine.

2. List the terminologies used in basic blocks.

Define and use – the three address statement a:=b+c is said to define a and to use b and c.

Live and dead – the name in the basic block is said to be live at a given point if its value is used after that point in the program. And the name in the basic block is said to be dead at a given point if its value is never used after that point in the program.

3. What is a flow graph? A flow graph is a directed graph in which the flow control information is added to the basic blocks.

B1 to block B2 if B2 immediately follows B1 in the given sequence. We can say that B1 is a predecessor of B2.

4. What is a DAG? Mention its applications. Directed acyclic graph(DAG) is a useful data structure for implementing transformations on basic blocks. DAG is used in

-expressions.

the block.

outside the block. list of quadruples by eliminating the common su-expressions

and not performing the assignment of the form x := y unless and until it is a must.

5. Define peephole optimization. Peephole optimization is a simple and effective technique for locally improving target code. This technique is applied to improve the performance of the target program by examining the short sequence of target instructions and replacing these instructions by shorter or faster sequence.

6. List the characteristics of peephole optimization.

7. How do you calculate the cost of an instruction?

The cost of an instruction can be computed as one plus cost associated with the source and destination addressing modes given by added cost. MOV R0,R1 1

14

MOV R1,M 2 SUB 5(R0),*10(R1) 3

8. What is a basic block? A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching. Eg. t1:=a*5 t2:=t1+7 t3:=t2-5 t4:=t1+t3 t5:=t2+b

9. Mention the issues to be considered while applying the techniques for code

optimization.

over the program efficiency must be achieved without changing the algorithm of the program.

10. What are the basic goals of code movement? To reduce the size of the code i.e. to obtain the space complexity. To reduce the frequency of execution of code i.e. to obtain the time complexity.

11. What do you mean by machine dependent and machine independent optimization?

machine for the instruction set used and addressing modes used for the instructions to produce the efficient target code.

programming languages for appropriate programming structure and usage of efficient arithmetic properties in order to reduce the execution time.

12. What are the different data flow properties?

13. List the different storage allocation strategies. The strategies are:

Heap allocation 14. What are the contents of activation record?

The activation record is a block of memory used for managing the information needed by a single execution of a procedure. Various fields f activation record are:

iables

15

15. What is dynamic scoping? In dynamic scoping a use of non-local variable refers to the non-local data declared in most recently called and still active procedure. Therefore each time new findings are set up for local names called procedure. In dynamic scoping symbol tables can be required at run time. 16. Define symbol table. Symbol table is a data structure used by the compiler to keep track of semantics of the variables. It stores information about scope and binding information about names. 17. What is code motion?

Code motion is an optimization technique in which amount of code in a loop is decreased. This transformation is applicable to the expression that yields the same result independent of the number of times the loop is executed. Such an expression is placed before the loop.

18. What are the properties of optimizing compiler? The source code should be such that it should produce minimum amount of target code. There should not be any unreachable code. Dead code should be completely removed from source language. The optimizing compilers should apply following code improving transformations on source language.

i) common subexpression elimination ii) dead code elimination iii) code movement iv) strength reduction

19. What are the various ways to pass a parameter in a function?

-restore

20. Suggest a suitable approach for computing hash function. Using hash function we should obtain exact locations of name in symbol table. The hash function should result in uniform distribution of names in symbol table. The hash function should be such that there will be minimum number of collisions. Collision is such a situation where hash function results in same location for storing the names.

REFERENCES:

1. Alfred V Aho, Monica S. Lam, Ravi Sethi and Jeffrey D Ullman, “Compilers –

Principles, Techniques and Tools”, 2nd Edition, Pearson Education, 2007.

2. Randy Allen, Ken Kennedy, “Optimizing Compilers for Modern Architectures: A

Dependence-based Approach”, Morgan Kaufmann Publishers, 2002.

3. Steven S. Muchnick, “Advanced Compiler Design and Implementation, “Morgan

Kaufmann Publishers - Elsevier Science, India, Indian Reprint 2003.

4. Keith D Cooper and Linda Torczon, “Engineering a Compiler”, Morgan Kaufmann

Publishers Elsevier Science, 2004.

5. Charles N. Fischer, Richard. J. LeBlanc, “Crafting a Compiler with C”, Pearson

Education, 2008.

compiler design · cs6660 compiler design l t p c 3 0 0 3 unit i introduction to compilers 5...

Documents