introduction to programmingsharif.edu/~sani/courses/compiler/05-introduction-to...alexaiken...

74
Compilers Introduction to Parsing Alex Aiken

Upload: others

Post on 31-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Compilers

Introduction to Parsing

Alex Aiken

Alex Aiken

Intro to Parsing

• Regular languages

– The weakest formal languages widely used

– Many applications

Alex Aiken

Intro to Parsing

Consider the language:

( i ) i | i 0

Alex Aiken

Intro to Parsing

A 1

1

00

B

What can regular languages express?

Alex Aiken

Intro to Parsing

What can regular languages express?

• Languages requiring counting modulo a fixed

integer

• Intuition: A finite automaton that runs long enough

must repeat states

• Finite automaton can’t remember # of times it has

visited a particular state

Alex Aiken

Intro to Parsing

• Input: sequence of tokens from lexer

• Output: parse tree of the program

(But some parsers never produce a parse tree …)

Intro to Parsing

• Cool

if x = y then 1 else 2fi

• Parser input

IF ID = ID THEN INT ELSE INT FI

• Parser output

=

ID ID

INT

IF-THEN-ELSE

INT

Alex Aiken

Alex Aiken

Intro to Parsing

Phase Input Output

Lexer String of

characters

String of tokens

Parser String of tokens Parse tree(may be implicit)

• Some compilers combine Lexer and Parser

Compilers

Context-Free Grammars

Alex Aiken

Alex Aiken

CFGs

• Not all strings of tokens are programs . ..

• . . . parser must distinguish between valid andinvalid strings of tokens

• We need

– Alanguage for describing valid strings of tokens

– A method for distinguishing valid frominvalid strings of tokens

Alex Aiken

CFGs

• Programming language constructs have recursive

structure

• An EXPRis

if EXPR then EXPR else EXPRfi

while EXPR loop EXPR pool

• Context-free grammars are a natural notation for this recursive structure

Alex Aiken

CFGs

• A CFG consists of

– A set of terminals T

– A set of non-terminals N

– A start symbol S (a non-terminal)

– A set of productions

Alex Aiken

• In these lecture notes

– Non-terminals are written upper-case

– Terminals are written lower-case

– The start symbol is the left-hand side of the first

production

CFGs

Alex AikenAlex Aiken

Consider the language:

S(S)

S N={S}

T={(,)}

CFGs

( i ) i | i 0

Alex Aiken

Read productions as rules:

X Y1…Yn

means Xcan be replaced by Y1…Yn

CFGs

Alex Aiken

1. Begin with a string with only the start symbol S

2. Replace any non-terminal X in the string by the

right-hand side of some production

X Y1…Yn

3. Repeat (2) until there are nonon-terminals in the

string

CFGs

Alex Aiken

More formally, write

if there is a production

CFGs

Alex Aiken

Write

If

In 0 or more steps

CFGs

Alex Aiken

Let G be a context-free grammar with start symbolS.

Then the language of G is:

CFGs

L(G) =

Let G be a context-free grammar with start symbolS.

Then Sentential Forms of Gare:

CFGs

SF(G) = S and (T N)*

Therefore:

L(G) SF(G)

Alex Aiken

• Terminals are so-called because there are no rules

for replacing them

• Once generated, terminals are permanent

• Terminals ought to be tokens of thelanguage

CFGs

Alex Aiken

A fragment of COOL

CFGs

Alex Aiken

CFGs

Some elements of the language:

id

if id then id else id fi

while id loop idpool

if while id loop id pool then id else id

if if id then id else id fi then id else id fi

Alex Aiken

CFGs

Simple arithmetic expressions

Some elements of the language:

CFGs

S aXa

X

| bY

Y

| cXc

abcba

acca

aba

abcbcba

Which of the strings are in

the language of the given

CFG?

CFGs

S aXa

X

| bY

Y

| cXc

abcba

acca

aba

abcbcba

Which of the strings are in

the language of the given

CFG?

Alex Aiken

CFGs

The idea of a CFG is a big step. But:

• Membership in a language is “yes” or “no”; - We also need a parse tree of the input

• Must handle errors gracefully

• Need an implementation of CFG’s (e.g.,Bison)

Alex Aiken

CFGs

• Form of the grammar isimportant

– Many grammars generate the same language

– Tools are sensitive to the grammar

Compilers

Derivations

AlexAiken

AlexAiken

Derivations

A derivation is a sequence ofproductions

S … … … … …

A derivation can be drawn as a tree

– Start symbol is the tree’sroot

– For a production X Y1…Yn add childrenY1…Yn

to node X

AlexAiken

Derivations

• Grammar

E E + E | E E | (E) | id

• String

id id + id

Derivations

E

E + E

E E + E

i d E + E

i d i d + E

i d i d + i d

E

E E+

idE * E

id id

AlexAiken

AlexAiken

Derivations

E

E

Derivations

E

E + E

E

E E

AlexAiken

+

Derivations

E

E + E

E E + E

E

E E+

E * E

AlexAiken

Derivations

E

E + E

E E + E

i d E + E

E

E E+

E * E

id

AlexAiken

Derivations

E

E + E

E E + E

i d E + E

i d i d + E

E

E E+

E * E

id id

AlexAiken

Derivations

E

E + E

E E + E

i d E + E

i d i d + E

i d i d + i d

E

E E+

idE * E

id id

AlexAiken

AlexAiken

Derivations

• A parse treehas

– Terminals at the leaves

– Non-terminals at the interiornodes

• An in-order traversal of the leaves is the original input

• The parse tree shows the association of operations, the

input string doesnot

AlexAiken

• The example is a left-most

derivation

– At each step, replace the left-

most non-terminal

• There is an equivalent

notion of a right-most

derivation

Derivations

AlexAiken

Derivations

E

E

Derivations

E

E + E

E

E E

AlexAiken

+

Derivations

E

E + E

E + id

E

E E+

id

AlexAiken

Derivations

E

E + E

E + i d

E E + id

E

E E+

idE * E

AlexAiken

Derivations

E

E + E

E + i d

E E + id

E i d + i d

E

E E+

idE * E

id

AlexAiken

Derivations

E

E + E

E + i d

E E + id

E i d + i d

i d i d + i d

E

E E+

idE * E

id id

AlexAiken

AlexAiken

Derivations

• Note that right-most and left-most derivations

havethe same parse tree

• The difference is the order in which branches

are added

Derivations

aXa

abYa

acXca

acca

Which of the following is a valid

derivation of the givengrammar?

S

S aXa

X | bY

Y | cXc |d

S

aXa

abYa

abcXca

abcbYca

abcbdca

S

aXa

abYa

abcXcda

abccda

S

aa

Derivations

aXa

abYa

acXca

acca

Which of the following is a valid

derivation of the givengrammar?

S

S aXa

X | bY

Y | cXc |d

S

aXa

abYa

abcXca

abcbYca

abcbdca

S

aXa

abYa

abcXcda

abccda

S

aa

Derivations

a

b

a

S

X

Y

c X c

Which of the following is a valid

parse tree for the givengrammar?

S

a X a

b Y

c X c d

a

Y

a

S

X

b

c X c d

a

b Y

a

S

X

S aXa

X | bY

Y | cXc |d

Derivations

a

b

a

S

X

Y

c X c

Which of the following is a valid

parse tree for the givengrammar?

S

a X a

b Y

c X c d

a

Y

a

S

X

b

c X c d

a

b Y

a

S

X

S aXa

X | bY

Y | cXc |d

AlexAiken

Derivations

• We are not just interested in whether s L(G)

– We need a parse tree for s

• A derivation defines a parsetree

– But one parse tree may have many derivations

• Left-most and right-most derivations areimportant in parser implementation

Compilers

Ambiguity

Alex Aiken

Alex Aiken

Ambiguity

• Grammar

E E+E | E E | (E) | id

• String

id id +id

Ambiguity

This string has two parse trees

E

E * E

id E + E

id id

E

E + E

E * E id

id id

Alex Aiken

Alex Aiken

Ambiguity

• A grammar is ambiguous if it has more than

one parse tree for somestring

– Equivalently, there is more than one right-most

or left-most derivation for somestring

• Ambiguity is BAD

– Leaves meaning of some programs ill-defined

Ambiguity

Which of the following grammars areambiguous?

S SS| a | b

E E+ E| id

S Sa| Sb

E E’ | E’+E

E’ -E’ | id | (E)

Ambiguity

Which of the following grammars areambiguous?

S SS| a | b

E E+ E| id

S Sa| Sb

E E’ | E’+E

E’ -E’ | id | (E)

Alex Aiken

Ambiguity

• There are several ways to handleambiguity

• Most direct method is to rewrite grammar unambiguously

E E’ + E | E’

E’ id E’ | id | (E) E’| (E)

• Enforces precedence of * over+

Ambiguity

E

E * E

id E + E

id id

E

E + E

E * E id

id id

Alex Aiken

Ambiguity

E

E * E

id E + E

id id

E

E + E

E * E id

id id

Alex Aiken

Alex Aiken

Ambiguity

• Consider the grammar

E if E then E

| if E then E else E

| OTHER

Ambiguity

• The expression

if E1 then if E2 then E3 else E4

has two parse trees

Alex Aiken

if

E1 if

E2 E3 E4

if

E2 E3

E1 E4

if

Alex Aiken

Ambiguity

• else matches the closest unmatched then

Ambiguity

• The expression if E1 then if E2 then E3 elseE4

if

E1 if

E2 E3 E4

if

E1 if

E2 E3

E4

Alex Aiken

Ambiguity

• The expression if E1 then if E2 then E3 elseE4

if

E1 if

E2 E3 E4

if

E1 if

E2 E3

E4

Alex Aiken

Ambiguity

S Sa | Sb | S SS’

S’a | b

S Sa | SbS S | S’

S’a | b

Choose the unambiguous version

of the givenambiguous grammar: S SS| a | b |

Ambiguity

S Sa | Sb | S SS’

S’a | b

S Sa | SbS S | S’

S’a | b

Choose the unambiguous version

of the givenambiguous grammar: S SS| a | b |

Alex Aiken

Ambiguity

• Impossible to convert automatically anambiguous

grammar to an unambiguous one

• Used with care, ambiguity can simplify the grammar

– Sometimes allows more natural definitions

– We need disambiguation mechanisms

Alex Aiken

Ambiguity

• Instead of rewriting thegrammar

– Use the more natural (ambiguous) grammar

– Along with disambiguating declarations

• Most tools allow precedence andassociativity

declarations to disambiguate grammars

Ambiguity

E

+ E

intint

• Consider the grammar E E + E | int

• Ambiguous: two parse trees of int + int+ int

E E

E

E + E E

+

+

E int int E

intint

• Left associativity declaration:%left +

Alex Aiken

Ambiguity

E

+ E

intint

• Consider the grammar E E + E| int

• Ambiguous: two parse trees of int + int+ int

E E

E

E + E E

+

+

E int int E

intint

• Left associativity declaration:%left +

Alex Aiken

Ambiguity

E

* E

intint

• Consider the grammar E E + E | E* E | int

– And the string int + int * intE E

E

E * E E +

E int int E+

int int

• Precedence declarations: %left +

%left *

Alex Aiken

Ambiguity

E

* E

intint

• Consider the grammar E E + E | E* E | int

– And the string int + int * intE E

E

E * E E +

E int int E+

int int

• Precedence declarations: %left +

%left *

Alex Aiken