course revision.. contents lexical analysis our compiler structure tokens, lexemes & patterns...

LESSON 32

Course Revision..

3

Contents Lexical Analysis

Our Compiler Structure Tokens, Lexemes & Patterns RE’s & Regular Definitions Transition Diagrams FA, DFA & NFA

Syntax Analysis Parsers CFG & Derivations Ambiguity Recursive Descent Parsers LL(1) Parsing

4

Contents.. LR Parsers (Simple, Canonical & LALR)

Syntax Directed Translation Inherited and Synthesized Attributes Dependency Graphs S-Attributed & L-Attributed Definitions Implementing S-Attributed & L-Attributed SDD's

Intermediate Code Generation Intermediate Representation Translate Expressions Check Types & Declarations Boolean Values & Jumping Code Control Flow Implement Records

5

Regular Expressions A regular expression is a sequence of characters that forms a

search pattern, mainly for use in pattern matching with strings.

The idea is that the regular expressions over an alphabet consist of the alphabet, and expressions using union, concatenation, and *, but it takes more words to say it right.

Each regular expression r denotes a language L(r) , which is also defined recursively from the languages denoted by r's sub-expressions.

Regular Definitions Complex collections of languages, such as the patterns that describe

the tokens of a programming language, are often defined by a regular definition which is a sequence of statements that each define one variable to stand for some regular expression.

The regular expression for one variable can use previously defined variables in its regular expression.

Extended Regular-Expression Notation A number of additional operators may appear as short hands in regular

expressions, to make it easier to express patterns. Examples include the + operator (one-or-more-of), ? (zero-or-one-of), and

character classes (the union of the strings each consisting of one of the characters)

6

7

Transition Diagrams The behavior of a lexical analyzer can often be described by a

transition diagram.

These diagrams have states, each of which represents something about the history of the characters seen during the current search for a lexeme that matches one of the possible patterns.

Transition graph for an NFA recognizing the language of regular expression (a | b) * abb

8

Transition Diagrams..

Transition diagrams have a collection of nodes or circles, called states

Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns.

Edges are directed from one state of the transition diagram to another.

Each edge is labeled by a symbol or set of symbols.

9

FA These are a formalization of transition diagrams that include a

designation of a start state and one or more accepting states, as well as the set of states, input characters, and transitions among states.

Accepting states indicate that the lexeme for some token has been found.

Unlike transition diagrams, finite automata can make transitions on empty input as well as on input characters .

10

DFA & NFA

A deterministic finite automaton (DFA) is a special case of an NFA where:

There are no moves on input ε, and

For each state S and input symbol a, there is exactly one edge out of s labeled a.

The DFA is easily simulated and makes a good implementation of a lexical analyzer, similar to a transition diagram .

11

DFA & NFA..

Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and ɛ, the empty string, is a possible label.

Both deterministic and nondeterministic finite automata are capable of recognizing the same languages.

NFA's often are easier to design than are DFA's.

12

DFA & NFA… It is possible to convert any regular expression into an NFA of

about the same size, recognizing the same language as the regular expression defines.

Any NFA can be converted to a DFA for the same pattern, although in the worst case the size of the automaton can grow exponentially.

It is also possible to convert any nondeterministic or deterministic finite automaton into a regular expression that defines the same language recognized by the finite automaton

13

DFA & NFA… Minimization of Finite Automata

For every DFA there is a minimum state DFA accepting the same language.

The minimum-state DFA for a given language is unique except for the names given to the various states.

Also there is a family of software systems, including Lex and Flex, that are lexical-analyzer generators. The user specifies the patterns for tokens using an extended regular-

expression notation. Lex converts these expressions into a lexical analyzer that is

essentially a deterministic finite automaton that recognizes any of the patterns.

Syntax Analysis

15

Parsers A parser takes as input tokens from the lexical analyzer and treats

the token names as terminal symbols of a context-free grammar.

The parser then constructs a parse tree for its input sequence of tokens, the parse tree may be constructed figuratively (by going through the corresponding derivation steps) or literally.

16

Context-Free Grammars

A grammar specifies a set of terminal symbols (inputs), another set of non-terminals (symbols representing syntactic constructs), and a set of productions each of which gives a way in which strings represented by one non-terminal can be constructed from terminal symbols and strings represented by certain other non-terminals.

A production consists of a head (the nonterminal to be replaced) and a body (the replacing string of grammar symbols ).

17

Derivations & Parse Trees The process of starting with the start-non-terminal of a grammar

and successively replacing it by the body of one of its productions is called a derivation.

If the leftmost (or rightmost) non-terminal is always replaced, then the derivation is called leftmost (respectively, rightmost).

A parse tree is a picture of a derivation, in which there is a node for each non-terminal that appears in the derivation.

The children of a node are the symbols by which that non-terminal is replaced in the derivation.

18

Ambiguity

A grammar for which some terminal string has two or more different parse trees or equivalently two or more leftmost derivations or two or more rightmost derivations, is said to be ambiguous.

In most cases it is possible to redesign an ambiguous grammar so it becomes an unambiguous grammar for the same language.

However, ambiguous grammars with certain tricks applied sometimes lead to more efficient parsers.

19

Ambiguity.. Ex: Two leftmost derivations for id + id * id

So two parse trees are

20

Parsing Top-Down and Bottom- Up Parsing

Parsers are generally distinguished by whether they work top-down (start with the grammar's start symbol and construct the parse tree from the top) or bottom-up (start with the terminal symbols that form the leaves & build the tree from the bottom) .

Top-down parsers include recursive-descent and LL parsers, while the most common forms of bottom-up parsers are LR parsers.

Grammars suitable for top-down parsing often are harder to design than those used by bottom-up parsers.

It is necessary to eliminate left-recursion & must left-factor.

21

Recursive Descent Parsers

These parsers use a procedure for each non-terminal.

The procedure looks at its input and decides which production to apply for its non-terminal.

Terminals in the body of the production are matched to the input at the appropriate time, while non-terminals in the body result in calls to their procedure.

Backtracking, in the case when the wrong production was chosen, is a possibility.

22

LL(l) Parsers

A grammar such that it is possible to choose the correct production with which to expand a given non-terminal, looking only at the next input symbol is called LL(l).

These grammars allow us to construct a predictive parsing table that gives, for each non-terminal and each look-ahead symbol, the correct choice of production.

Error correction can be facilitated by placing error routines in some or all of the table entries that have no legitimate production.

23

Shift-Reduce Parsing Bottom-up parsers generally operate by choosing, on the basis of the

next input symbol (look-ahead) and the contents of the stack, whether to shift the next input onto the stack, or to reduce some symbols at the top of the stack.

A reduce step takes a production body at the top of the stack and replaces it by the head of the production.

In shift-reduce parsing, the stack contents are always a viable prefix - that is, a prefix of some right-sentential form that ends no further right than the end of the handle of that right-sentential form.

The handle is the substring that was introduced in the last step of the rightmost derivation of that sentential form.

24

LR Parsers LR Parsers

Each of the several kinds of LR parsers operate by first constructing the sets of valid items (called LR states) for all possible viable prefixes, and keeping track of the state for each prefix on the stack.

The set of valid items guide the shift-reduce parsing decision.

We prefer to reduce if there is a valid item with the dot at the right end of the body, and we prefer to shift the look-ahead symbol onto the stack if that symbol appears immediately to the right of the dot in some valid item.

25

LR Parsers..

Simple LR Parsers

In an SLR parser, we perform a reduction implied by a valid item with a dot at the right end, provided the look-ahead symbol can follow the head of that production in some sentential form.

The grammar is SLR and this method can be applied, if there are no parsing-action conflicts that is, for no set of items, and for no look-ahead symbol, are there two productions to reduce by, nor is there the option to reduce or to shift.

26

LR Parsers... Canonical LR Parsers

This more complex form of LR parser uses items that are augmented by the set of look-ahead symbols that can follow the use of the underlying production.

Reductions are only chosen when there is a valid item with the dot at the right end, and the current look-ahead symbol is one of those allowed for this item.

A canonical-LR parser can avoid some of the parsing-action conflicts that are present in SLR parsers, but often has many more states than the SLR parser for the same grammar .

27

LR Parsers... Look-ahead LR Parsers

LALR parsers offer many of the advantages of SLR and Canonical-LR parsers, by combining the states that have the same kernels (sets of items, ignoring the associated look-ahead sets).

Thus, the number of states is the same as that of the SLR parser, but some parsing-action conflicts present in the SLR parser may be removed in the LALR parser.

LALR parsers have become the method of choice in practice.

Syntax Directed Translation

29

Inherited & Synthesized Attributes In syntax-directed translation we construct a parse tree or a

syntax tree, and then to compute the values of attributes at the nodes of the tree by visiting the nodes of the tree.

Syntax-directed definitions may use two kinds of attributes.

A synthesized attribute at a parse-tree node is computed from attributes at its children.

An inherited attribute at a node is computed from attributes at its parent and/or siblings.

30

Dependency Graphs Given a parse tree and an SDD, we draw edges among the

attribute instances associated with each parse-tree node to denote that the value of the attribute at the head of the edge is computed in terms of the value of the attribute at the tail of the edge. The black dotted lines comprise the parse tree for the multiplication

grammar, e.g. 3*5.

31

Dependency Graphs.. Cyclic Definitions

In problematic SDD's, we find that there are some parse trees for which it is impossible to find an order in which we can compute all the attributes at all nodes.

These parse trees have cycles in their associated dependency graphs.

It is intractable to decide whether an SDD has such circular dependency graphs .

32

S-Attributed & L-Attributed Definitions

In S-attributed SDD all attributes are synthesized.

In an L-attributed SDD attributes may be inherited or synthesized.

However, inherited attributes at a parse-tree node may depend only on inherited attributes of its parent and on (any) attributes of siblings to its left .

Each node in a syntax tree represents a construct, the children of the node represent the meaningful components of the construct.

33

Implementing S & L-Attributed SDD's Implementing S-Attributed SDD 's

An S-attributed definition can be implemented by an SDT in which all actions are at the end of the production (a "postfix" SDT).

The actions compute the synthesized attributes of the production head in terms of synthesized attributes of the symbols in the body. If the underlying grammar is LR, then this SDT can be implemented on the LR parser stack.

Eliminating Left Recursion From SDT's If an SDT has only side-effects (no attributes are computed) , then

the standard left-recursion-elimination algorithm for grammars allows us to carry the actions along as if they were terminals.

When attributes are computed, we can still eliminate left recursion if the SDT is a postfix SDT .

34

Implementing S & L-Attributed SDD's..

Implementing L-attributed SDD's by Recursive-Descent Parsing

If we have an L-attributed definition on a top-down pars able grammar, we can build a recursive-descent parser with no backtracking to implement the translation.

Inherited attributes become arguments of the functions for their non-terminals, and synthesized attributes are returned by that function.

35

Implementing S & L-Attributed SDD's...

Implementing L-attributed SDD's on an LL Grammar

Every L-attributed definition with an underlying LL grammar can be implemented along with the parse.

Records to hold the synthesized attributes for a non-terminal are placed below that non-terminal on the stack, while inherited attributes for a non-terminal are stored with that non-terminal on the stack.

Action records are also placed on the stack to compute attributes at the appropriate time .

36

Implementing S & L-Attributed SDD's...

Implementing L-attributed SDD's on an LL Grammar Bottom-up

An L-attributed definition with an underlying LL grammar can be converted to a translation on an LR grammar and the translation performed in connection with a bottom-up parse.

The grammar transformation introduces "marker" non-terminals that appear on the bottom-up parser's stack and hold inherited attributes of the non-terminal above it on the stack.

Synthesized attributes are kept with their non-terminal on the stack.

Intermediate Code Generation

38

Intermediate Representation We can combine the pieces learned in Intermediate code

generation section to build a simple compiler front end.

The front end can be built incrementally:

Pick an intermediate representationAn intermediate representation is typically a combination of a graphical notation and three-address code. As in syntax trees, a node in a graphical notation represents a construct, the children of a node represent its sub constructs. Three address code takes its name from instructions of the form x = y op Z with at most one operator per instruction. There are additional instructions for control flow .

39

Translate Expressions Translate expressions

Expressions with built-up operations can be unwound into a sequence of individual operations by attaching actions to each production of the form E → E1 op E2

The action either creates a node for E with the nodes for E1 and E2 as children or

It generates a three-address instruction that applies op to the addresses for E1 and E2 and puts the result into a new temporary name, which becomes the address for E.

40

Check Types Check types

The type of an expression E1 op E2 is determined by the operator op and the types of E1 and E2

A coercion is an implicit type conversion, such as from integer to float.

Intermediate code contains explicit type conversions to ensure an exact match between operand types and the types expected by an operator.

41

Declarations Use a symbol table to implement declarations

A declaration specifies the type of a name.

The width of a type is the amount of storage needed for a name with that type.

Using widths, the relative address of a name at run time can be computed as an offset from the start of a data area.

The type and relative address of a name are put into the symbol table due to a declaration, so the translator can subsequently get them when the name appears in an expression.

42

Flatten Arrays Flatten arrays

For quick access, array elements are stored in consecutive locations.

Arrays of arrays are flattened so they can be treated as a one dimensional array of individual elements.

The type of an array is used to calculate the address of an array element relative to the base of the array.

Generate jumping code for boolean expressions

In short-circuit or jumping code, the value of a boolean expression is implicit in the position reached in the code.

43

Boolean Values & Jumping Code

Jumping code is useful because a boolean expression B is typically used for control flow, as in if (B) S.

Boolean values can be computed by jumping to t = true or t = false, as appropriate, where t is a temporary name.

Using labels for jumps a boolean expression can be translated by inheriting labels corresponding to its true and false exits.

The constants true and false translate into a jump to the true and false exits, respectively .

44

Control Flow Implement statements using control flow

Statements can be translated by inheriting a label next, where next marks the first instruction after the code for this statement .

The conditional S → if (B) S1 can be translated by attaching a new label marking the beginning of the code for S1 and passing the new label and S.next for the true and false exits, respectively, of B.

45

Implement Records Implement records

Field names in a record or class can be treated as a sequence of declarations.

A record type encodes the types and relative addresses of the fields.

A symbol table object can be used for this purpose .

Thank You

course revision.. contents lexical analysis our compiler structure tokens, lexemes & patterns...

Documents