cs-338 compiler design

48
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University

Upload: duyen

Post on 13-Jan-2016

91 views

Category:

Documents


0 download

DESCRIPTION

CS-338 Compiler Design. Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University. Chapter 3: Lexical Analyzer. THE ROLE OF LEXICAL ANALYSER : It is the first phase of the compiler. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS-338 Compiler Design

CS-338Compiler Design

Dr. Syed Noman HasanyAssistant Professor

College of Computer, Qassim University

Page 2: CS-338 Compiler Design

Chapter 3: Lexical Analyzer

• THE ROLE OF LEXICAL ANALYSER :

It is the first phase of the compiler. It reads the input characters and produces as output a

sequence of tokens that the parser uses for syntax analysis.

It strips out from the source program comments and white spaces in the form of blank , tab and newline characters .

It also correlates error messages from the compiler with the source program (because it keeps track of line numbers).

2

Page 3: CS-338 Compiler Design

3

Interaction of the Lexical Analyzer with the Parser

LexicalAnalyzer

ParserSource

Program

Token,tokenval

Symbol Table

Get nexttoken

error error

Page 4: CS-338 Compiler Design

4

The Reason Why Lexical Analysis is a Separate Phase

• Simplifies the design of the compiler– LL(1) or LR(1) parsing with 1 token lookahead would

not be possible (multiple characters/tokens to match)

• Provides efficient implementation– Systematic techniques to implement lexical analyzers

by hand or automatically from specifications– Stream buffering methods to scan input

• Improves portability– Non-standard symbols and alternate character

encodings can be normalized (e.g. trigraphs)

Page 5: CS-338 Compiler Design

5

Attributes of Tokens

Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

y := 31 + 28*x

Parser

token

tokenval(token attribute)

Page 6: CS-338 Compiler Design

6

Tokens, Patterns, and Lexemes

• A token is a classification of lexical units– For example: id and num

• Lexemes are the specific character strings that make up a token– For example: abc and 123

• Patterns are rules describing the set of lexemes belonging to a token– For example: “letter followed by letters and digits” and

“non-empty sequence of digits”

Page 7: CS-338 Compiler Design

Tokens, Patterns, and Lexemes

• A lexeme is a sequence of characters from the source program that is matched by a pattern for a token.

7

lexeme Pattern Token

Page 8: CS-338 Compiler Design

Tokens, Patterns, and Lexemes

Token Sample Lexemes Informal Description of Pattern

const

if

relation

id

num

literal

const

if

<, <=, =, < >, >, >=

pi, count, D2

3.1416, 0, 6.02E23

“core dumped”

const

if

< or <= or = or < > or >= or >

letter followed by letters and digits

any numeric constant

any characters between “ and “ except “

Classifies Pattern

Actual values are critical. Info is :

1. Stored in symbol table2. Returned to parser

Page 9: CS-338 Compiler Design

• Examining ways of speeding reading the source program– In one buffer technique, the last lexeme under process will be over-written when we

reload the buffer.

– Two-buffer scheme handling large look ahead safely

3.2 Input Buffering

Page 10: CS-338 Compiler Design

• Two buffers of the same size, say 4096, are alternately reloaded.• Two pointers to the input are maintained:

– Pointer lexeme_Begin marks the beginning of the current lexeme.

– Pointer forward scans ahead until a pattern match is found.

3.2.1 Buffer Pairs

Page 11: CS-338 Compiler Design

If forward at end of first half then begin

reload second half;

forward:=forward + 1;

End

Else

if forward at end of second half then begin

reload first half;

move forward to beginning of first half

End

Else forward:=forward + 1;

Page 12: CS-338 Compiler Design

3.2.2 Sentinels

E = M * eof C * * 2 eof eof

Page 13: CS-338 Compiler Design

forward:=forward+1;

If forward ^ = EOF then begin

If forward at end of first half then begin

reload second half;

forward:=forward + 1;

End

Else if forward at end of second half then begin

reload first half;

move forward to beginning of first half

End

Else terminate lexical analysis;

Page 14: CS-338 Compiler Design

14

Specification of Patterns for Tokens: Definitions

• An alphabet is a finite set of symbols (characters)

• A string s is a finite sequence of symbols from s denotes the length of string s denotes the empty string, thus = 0

• A language is a specific set of strings over some fixed alphabet

Page 15: CS-338 Compiler Design

15

Specification of Patterns for Tokens: String Operations

• The concatenation of two strings x and y is denoted by xy

• The exponentation of a string s is defined by

s0 = (Empty string: a string of length zero)

si = si-1s for i > 0

note that s = s = s

Page 16: CS-338 Compiler Design

16

Specification of Patterns for Tokens: Language Operations

• UnionL M = {s s L or s M}

• ConcatenationLM = {xy x L and y M}

• ExponentiationL0 = {}; Li = Li-1L

• Kleene closureL* = i=0,…, Li

• Positive closureL+ = i=1,…, Li

Page 17: CS-338 Compiler Design

Language Operations ExamplesL = {A, B, C, D } D = {1, 2, 3}

L D = {A, B, C, D, 1, 2, 3 }

LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }

L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}

L4 = L2 L2 = ??

L* = { All possible strings of L plus }

L+ = L* -

L (L D ) = ??

L (L D )* = ??

Page 18: CS-338 Compiler Design

18

Specification of Patterns for Tokens: Regular Expressions

• Basis symbols: is a regular expression denoting language {}– a is a regular expression denoting {a}

• If r and s are regular expressions denoting languages L(r) and M(s) respectively, then– rs is a regular expression denoting L(r) M(s)– rs is a regular expression denoting L(r)M(s)– r* is a regular expression denoting L(r)*

– (r) is a regular expression denoting L(r)

• A language defined by a regular expression is called a regular set

Page 19: CS-338 Compiler Design

• Examples:– let

a | b

(a | b) (a | b)

a *

(a | b)*

a | a*b

– We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative

• (a) | ((b)*(c ) ) = a | b*c

},{ ba

Page 20: CS-338 Compiler Design

Algebraic Properties of Regular Expressions

AXIOM DESCRIPTION

r | s = s | r

r | (s | t) = (r | s) | t

(r s) t = r (s t)

r = rr = r

r* = ( r | )*

r ( s | t ) = r s | r t( s | t ) r = s r | t r

r** = r*

| is commutative

| is associative

concatenation is associative

concatenation distributes over |

relation between * and

Is the identity element for concatenation

* is idempotent

Page 21: CS-338 Compiler Design

Finite Automaton• Given an input string, we need a “machine”

that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not.

• A machine that determines whether a given string belongs to a language is called a finite automaton.

Page 22: CS-338 Compiler Design

Deterministic Finite Automaton• Definition: Deterministic Finite Automaton

– a five-tuple (, S, , s0, F) where is the alphabet• S is the set of states is the transition function (SS)• s0 is the starting state• F is the set of final states (F S)

• Notation: – Use a transition diagram to describe a DFA

• states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting

• If the automaton stops at a final state on end of input, then the input string belongs to the language.

Page 23: CS-338 Compiler Design

① a

={a}

L= {a}

S = {1,2}

(1,a)=2

S0 = 1

F = {2}

Page 24: CS-338 Compiler Design

② a|b

={a,b}

L = {a,b}

S = {1,2}

(1,a)=2, (1,b)=2

S0 = 1

F = {2}

Page 25: CS-338 Compiler Design

③ a(a|b)

={a,b}L = {aa,ab}S = {1,2,3} (1,a)=2, (2,a)=3, (2,b)=3

S0 = 1F = {3}

Page 26: CS-338 Compiler Design

④ a*

= {a}

L = {,a,aa,aaa,aaaa,…}

S = {1}

(1, )=1, (1,a)=1

S0 = 1

F = {1}

Page 27: CS-338 Compiler Design

⑤a⁺

={a}L = {a,aa,aaa,aaaa,…}S = {1,2} (1,a)=2, (2,a)=2

S0 = 1F = {2}Note: a =aa*⁺

Page 28: CS-338 Compiler Design

⑥ (a|b)(a|b)b

= {a,b}L = {aab,abb,bab,bbb}S = {1,2,3,4}(1,a)=2, (1,b)=2, (2,a)=3, (2,b)=3, (3,b)=4

S0 = 1F = {4}

Page 29: CS-338 Compiler Design

⑦ (a|b)*

={a,b}L={,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,

…,baba,bbba,…,…}S = {1} (1,a)=1, (1,b)=1

S0 = 1F = {1}

Page 30: CS-338 Compiler Design

⑧ (a|b)⁺

={a,b}

L = {a,aa,aaa,…,b,bb,bbb,…}

S = {1,2}

(1,a)=2, (1,b)=2, (2,a)=2, (2,b)=2

S0 = 1

F = {2}

Note: (a|b) =(a|b)(a|b)*⁺

Page 31: CS-338 Compiler Design

⑨a |b⁺ ⁺

={a,b}

L = {a,aa,aaa,…,b,bb,bbb,…}

S = {1,2,3}

(1,a)=2, (2,a)=2, (1,b)=3, (3,b)=3

S0 = 1

F = {2,3}

Page 32: CS-338 Compiler Design

⑩a(a|b)*

={a,b}

L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bababa,…}

S = {1,2}

(1,a)=2, (2,a)=2, (2,b)=2

S0 = 1

F = {2}

Page 33: CS-338 Compiler Design

⑪a(b|a)b⁺

={a,b}L = {aab,abb,aabb,…,abbb,abbbb,…}S ={1,2,3,4} (1,a)=2, (2,a)=3, (2,b)=3, (3,b)=4,(4,b)=4

S0 = 1F = {4}

Page 34: CS-338 Compiler Design

⑫ ab*a(a |b )⁺ ⁺

={a,b}L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…}S = {1,2,3,4,5} (1,a)=2, (2,b)=2, (2,a)=3, (3,a)=4, (4,a)=4, (3,b)=5, (5,b)=5

S0 = 1F = {4,5}

Page 35: CS-338 Compiler Design

35

Specification of Patterns for Tokens: Regular Definitions

• Regular definitions introduce a naming convention: d1 r1

d2 r2

…dn rn

where each ri is a regular expression over {d1, d2, …, di-1 }

• Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

Page 36: CS-338 Compiler Design

36

Specification of Patterns for Tokens: Regular Definitions

• Example:

letter AB…Zab…z digit 01…9 id letter ( letterdigit )*

• Regular definitions are not recursive:

digits digit digitsdigit wrong!

Page 37: CS-338 Compiler Design

37

Specification of Patterns for Tokens: Notational Shorthand

• The following shorthands are often used:

r+ = rr*

r? = r[a-z] = abc…z

• Examples:digit [0-9]num digit+ (. digit+)? ( E (+-)? digit+ )?

Page 38: CS-338 Compiler Design

38

Regular Definitions and Grammars

stmt if expr then stmt if expr then stmt else stmt

expr term relop term termterm id num

if if then then else elserelop < <= <> > >= = id letter ( letter | digit )*

num digit+ (. digit+)? ( E (+-)? digit+ )?

Grammar

Regular definitions

Page 39: CS-338 Compiler Design

Constructing Transition Diagrams for Tokens

• Transition Diagrams (TD) are used to represent the tokens – these are automatons!

• As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern

• Each TD has:

• States : Represented by Circles

• Actions : Represented by Arrows between states

• Start State : Beginning of a pattern (Arrowhead)

• Final State(s) : End of pattern (Concentric Circles)

• Each TD is Deterministic - No need to choose between 2 different actions !

Page 40: CS-338 Compiler Design

Example : All RELOPsstart <

0

other

=6 7

8

return(relop, LE)

5

4

>

=1 2

3

other

>

=

*

*

return(relop, NE)

return(relop, LT)

return(relop, EQ)

return(relop, GE)

return(relop, GT)

Page 41: CS-338 Compiler Design

Example TDs : id and delim

Keyword or id :

delim :

start delim28

other3029

delim

*

return(install_id(), gettoken())

start letter9

other1110

letter or digit

*

Page 42: CS-338 Compiler Design

Combine TD for KW and IDs

• Install_id(): decides for the attribute– It will check the accepted lexeme in the list of keywords; if it is

matched, zero is returned.

– Otherwise checks the lexeme in symbol table, if it is found, the address is returned.

– If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry.

• Gettoken(): decides for the token– If zero returned by install_id(), the same word(or its numeric

form) is returned as token

– Otherwise token “ID” is returned.

42

Page 43: CS-338 Compiler Design

Example TDs : Unsigned #s

1912 1413 1615 1817start otherdigit . digit E + | - digit

digit

digit

digit

E

digit

*

start digit25

other2726

digit

*

start digit20

* .21

digit

24other

23

digit

digit22

*

Questions: Is ordering important for unsigned #s ?

Why are there no TDs for then, else, if ?

Page 44: CS-338 Compiler Design

Keywords RecognitionAll Keywords / Reserved words are matched as ids

• After the match, the symbol table or a special keyword table is consulted

• Keyword table contains string versions of all keywords and associated token values

if

begin

then

0

0

0

... ...

• If a match is not found, then it is assumed that an id has been discovered

Page 45: CS-338 Compiler Design

Transition Diagrams & Lexical Analyzers

state = 0;

token nexttoken()

{ while(1) {

switch (state) {

case 0: c = nextchar();

/* c is lookahead character */

if (c== blank || c==tab || c== newline) {

state = 0;

lexeme_beginning++;

/* advance beginning of lexeme */

}

else if (c == ‘<‘) state = 1;

else if (c == ‘=‘) state = 5;

else if (c == ‘>’) state = 6;

else state = fail();

break;

… /* cases 1-8 here */

Page 46: CS-338 Compiler Design

case 9: c = nextchar();

if (isletter(c)) state = 10;

else state = fail();

break;

case 10; c = nextchar();

if (isletter(c)) state = 10;

else if (isdigit(c)) state = 10;

else state = 11;

break;

case 11; retract(1); install_id();

return ( gettoken() );

… /* cases 12-24 here */

case 25; c = nextchar();

if (isdigit(c)) state = 26;

else state = fail();

break;

case 26; c = nextchar();

if (isdigit(c)) state = 26;

else state = 27;

break;

case 27; retract(1); install_num();

return ( NUM ); } } }

Case numbers correspond to transition diagram states !

Page 47: CS-338 Compiler Design

When Failures Occur:int state = 0, start = 0;

Int lexical_value;

/* to “return” second component of token */

Init fail()

{

forward = token_beginning;

switch (start) {

case 0: start = 9; break;

case 9: start = 12; break;

case 12: start = 20; break;

case 20: start = 25; break;

case 25: recover(); break;

default: /* compiler error */

}

return start;

}

Page 48: CS-338 Compiler Design

Using a Lex Generator

Lex source prog lex.yy.c lex.l

lex.yy.c a.out

Input stream sequence of input.c tokens

Lex

Compiler

C

compiler

a.out