cs-338 compiler design

CS-338Compiler Design

Dr. Syed Noman HasanyAssistant Professor

College of Computer, Qassim University

Chapter 3: Lexical Analyzer

• THE ROLE OF LEXICAL ANALYSER :

It is the first phase of the compiler. It reads the input characters and produces as output a

sequence of tokens that the parser uses for syntax analysis.

It strips out from the source program comments and white spaces in the form of blank , tab and newline characters .

It also correlates error messages from the compiler with the source program (because it keeps track of line numbers).

2

3

Interaction of the Lexical Analyzer with the Parser

LexicalAnalyzer

ParserSource

Program

Token,tokenval

Symbol Table

Get nexttoken

error error

4

The Reason Why Lexical Analysis is a Separate Phase

• Simplifies the design of the compiler– LL(1) or LR(1) parsing with 1 token lookahead would

not be possible (multiple characters/tokens to match)

• Provides efficient implementation– Systematic techniques to implement lexical analyzers

by hand or automatically from specifications– Stream buffering methods to scan input

• Improves portability– Non-standard symbols and alternate character

encodings can be normalized (e.g. trigraphs)

5

Attributes of Tokens

Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

y := 31 + 28*x

Parser

token

tokenval(token attribute)

6

Tokens, Patterns, and Lexemes

• A token is a classification of lexical units– For example: id and num

• Lexemes are the specific character strings that make up a token– For example: abc and 123

• Patterns are rules describing the set of lexemes belonging to a token– For example: “letter followed by letters and digits” and

“non-empty sequence of digits”


• A lexeme is a sequence of characters from the source program that is matched by a pattern for a token.

7

lexeme Pattern Token


Token Sample Lexemes Informal Description of Pattern

const

if

relation

id

num

literal

const

if

<, <=, =, < >, >, >=

pi, count, D2

3.1416, 0, 6.02E23

“core dumped”

const

if

< or <= or = or < > or >= or >

letter followed by letters and digits

any numeric constant

any characters between “ and “ except “

Classifies Pattern

Actual values are critical. Info is :

1. Stored in symbol table2. Returned to parser

• Examining ways of speeding reading the source program– In one buffer technique, the last lexeme under process will be over-written when we

reload the buffer.

– Two-buffer scheme handling large look ahead safely

3.2 Input Buffering

• Two buffers of the same size, say 4096, are alternately reloaded.• Two pointers to the input are maintained:

– Pointer lexeme_Begin marks the beginning of the current lexeme.

– Pointer forward scans ahead until a pattern match is found.

3.2.1 Buffer Pairs

If forward at end of first half then begin

reload second half;

forward:=forward + 1;

End

Else

if forward at end of second half then begin

reload first half;

move forward to beginning of first half

End

Else forward:=forward + 1;

3.2.2 Sentinels

E = M * eof C * * 2 eof eof

forward:=forward+1;

If forward ^ = EOF then begin

If forward at end of first half then begin

reload second half;

forward:=forward + 1;

End

Else if forward at end of second half then begin

reload first half;

move forward to beginning of first half

End

Else terminate lexical analysis;

14

Specification of Patterns for Tokens: Definitions

• An alphabet is a finite set of symbols (characters)

• A string s is a finite sequence of symbols from s denotes the length of string s denotes the empty string, thus = 0

• A language is a specific set of strings over some fixed alphabet

15

Specification of Patterns for Tokens: String Operations

• The concatenation of two strings x and y is denoted by xy

• The exponentation of a string s is defined by

s0 = (Empty string: a string of length zero)

si = si-1s for i > 0

note that s = s = s

16

Specification of Patterns for Tokens: Language Operations

• UnionL M = {s s L or s M}

• ConcatenationLM = {xy x L and y M}

• ExponentiationL0 = {}; Li = Li-1L

• Kleene closureL* = i=0,…, Li

• Positive closureL+ = i=1,…, Li

Language Operations ExamplesL = {A, B, C, D } D = {1, 2, 3}

L D = {A, B, C, D, 1, 2, 3 }

LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }

L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}

L4 = L2 L2 = ??

L* = { All possible strings of L plus }

L+ = L* -

L (L D ) = ??

L (L D )* = ??

18

Specification of Patterns for Tokens: Regular Expressions

• Basis symbols: is a regular expression denoting language {}– a is a regular expression denoting {a}

• If r and s are regular expressions denoting languages L(r) and M(s) respectively, then– rs is a regular expression denoting L(r) M(s)– rs is a regular expression denoting L(r)M(s)– r* is a regular expression denoting L(r)*

– (r) is a regular expression denoting L(r)

• A language defined by a regular expression is called a regular set

• Examples:– let

a | b

(a | b) (a | b)

a *

(a | b)*

a | a*b

– We assume that ‘*’ has the highest precedence and is left associative. Concatenation has second highest precedence and is left associative and ‘|’ has the lowest precedence and is left associative

• (a) | ((b)*(c ) ) = a | b*c

},{ ba

Algebraic Properties of Regular Expressions

AXIOM DESCRIPTION

r | s = s | r

r | (s | t) = (r | s) | t

(r s) t = r (s t)

r = rr = r

r* = ( r | )*

r ( s | t ) = r s | r t( s | t ) r = s r | t r

r** = r*

| is commutative

| is associative

concatenation is associative

concatenation distributes over |

relation between * and

Is the identity element for concatenation

* is idempotent

Finite Automaton• Given an input string, we need a “machine”

that has a regular expression hard-coded in it and can tell whether the input string matches the pattern described by the regular expression or not.

• A machine that determines whether a given string belongs to a language is called a finite automaton.

Deterministic Finite Automaton• Definition: Deterministic Finite Automaton

– a five-tuple (, S, , s0, F) where is the alphabet• S is the set of states is the transition function (SS)• s0 is the starting state• F is the set of final states (F S)

• Notation: – Use a transition diagram to describe a DFA

• states are nodes, transitions are directed, labeled edges, some states are marked as final, one state is marked as starting

• If the automaton stops at a final state on end of input, then the input string belongs to the language.

① a

={a}

L= {a}

S = {1,2}

(1,a)=2

S0 = 1

F = {2}

② a|b

={a,b}

L = {a,b}

S = {1,2}

(1,a)=2, (1,b)=2

S0 = 1

F = {2}

③ a(a|b)

={a,b}L = {aa,ab}S = {1,2,3} (1,a)=2, (2,a)=3, (2,b)=3

S0 = 1F = {3}

④ a*

= {a}

L = {,a,aa,aaa,aaaa,…}

S = {1}

(1, )=1, (1,a)=1

S0 = 1

F = {1}

⑤a⁺

={a}L = {a,aa,aaa,aaaa,…}S = {1,2} (1,a)=2, (2,a)=2

S0 = 1F = {2}Note: a =aa*⁺

⑥ (a|b)(a|b)b

= {a,b}L = {aab,abb,bab,bbb}S = {1,2,3,4}(1,a)=2, (1,b)=2, (2,a)=3, (2,b)=3, (3,b)=4

S0 = 1F = {4}

⑦ (a|b)*

={a,b}L={,a,b,aa,bb,ba,ab,aaa,…,bbb,…,abab,

…,baba,bbba,…,…}S = {1} (1,a)=1, (1,b)=1

S0 = 1F = {1}

⑧ (a|b)⁺

={a,b}

L = {a,aa,aaa,…,b,bb,bbb,…}

S = {1,2}

(1,a)=2, (1,b)=2, (2,a)=2, (2,b)=2

S0 = 1

F = {2}

Note: (a|b) =(a|b)(a|b)*⁺

⑨a |b⁺ ⁺

={a,b}

L = {a,aa,aaa,…,b,bb,bbb,…}

S = {1,2,3}

(1,a)=2, (2,a)=2, (1,b)=3, (3,b)=3

S0 = 1

F = {2,3}

⑩a(a|b)*

={a,b}

L={a,aa,ab,…,aba,…,abb,…,baa,abbb,…,bababa,…}

S = {1,2}

(1,a)=2, (2,a)=2, (2,b)=2

S0 = 1

F = {2}

⑪a(b|a)b⁺

={a,b}L = {aab,abb,aabb,…,abbb,abbbb,…}S ={1,2,3,4} (1,a)=2, (2,a)=3, (2,b)=3, (3,b)=4,(4,b)=4

S0 = 1F = {4}

⑫ ab*a(a |b )⁺ ⁺

={a,b}L = {aaa,aab,abaa,abbaa,…,abbab,abbabbb,…}S = {1,2,3,4,5} (1,a)=2, (2,b)=2, (2,a)=3, (3,a)=4, (4,a)=4, (3,b)=5, (5,b)=5

S0 = 1F = {4,5}

35

Specification of Patterns for Tokens: Regular Definitions

• Regular definitions introduce a naming convention: d1 r1

d2 r2

…dn rn

where each ri is a regular expression over {d1, d2, …, di-1 }

• Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

36

Specification of Patterns for Tokens: Regular Definitions

• Example:

letter AB…Zab…z digit 01…9 id letter ( letterdigit )*

• Regular definitions are not recursive:

digits digit digitsdigit wrong!

37

Specification of Patterns for Tokens: Notational Shorthand

• The following shorthands are often used:

r+ = rr*

r? = r[a-z] = abc…z

• Examples:digit [0-9]num digit+ (. digit+)? ( E (+-)? digit+ )?

38

Regular Definitions and Grammars

stmt if expr then stmt if expr then stmt else stmt

expr term relop term termterm id num

if if then then else elserelop < <= <> > >= = id letter ( letter | digit )*

num digit+ (. digit+)? ( E (+-)? digit+ )?

Grammar

Regular definitions

Constructing Transition Diagrams for Tokens

• Transition Diagrams (TD) are used to represent the tokens – these are automatons!

• As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern

• Each TD has:

• States : Represented by Circles

• Actions : Represented by Arrows between states

• Start State : Beginning of a pattern (Arrowhead)

• Final State(s) : End of pattern (Concentric Circles)

• Each TD is Deterministic - No need to choose between 2 different actions !

Example : All RELOPsstart <

0

other

=6 7

8

return(relop, LE)

5

4

>

=1 2

3

other

>

=

*

*

return(relop, NE)

return(relop, LT)

return(relop, EQ)

return(relop, GE)

return(relop, GT)

Example TDs : id and delim

Keyword or id :

delim :

start delim28

other3029

delim

*

return(install_id(), gettoken())

start letter9

other1110

letter or digit

*

Combine TD for KW and IDs

• Install_id(): decides for the attribute– It will check the accepted lexeme in the list of keywords; if it is

matched, zero is returned.

– Otherwise checks the lexeme in symbol table, if it is found, the address is returned.

– If the lexeme not found in symbol table, install_id() first installs the ID in the symbol table and return the address of the newly created entry.

• Gettoken(): decides for the token– If zero returned by install_id(), the same word(or its numeric

form) is returned as token

– Otherwise token “ID” is returned.

42

Example TDs : Unsigned #s

1912 1413 1615 1817start otherdigit . digit E + | - digit

digit

digit

digit

E

digit

*

start digit25

other2726

digit

*

start digit20

* .21

digit

24other

23

digit

digit22

*

Questions: Is ordering important for unsigned #s ?

Why are there no TDs for then, else, if ?

Keywords RecognitionAll Keywords / Reserved words are matched as ids

• After the match, the symbol table or a special keyword table is consulted

• Keyword table contains string versions of all keywords and associated token values

if

begin

then

0

0

0

... ...

• If a match is not found, then it is assumed that an id has been discovered

Transition Diagrams & Lexical Analyzers

state = 0;

token nexttoken()

{ while(1) {

switch (state) {

case 0: c = nextchar();

/* c is lookahead character */

if (c== blank || c==tab || c== newline) {

state = 0;

lexeme_beginning++;

/* advance beginning of lexeme */

}

else if (c == ‘<‘) state = 1;

else if (c == ‘=‘) state = 5;

else if (c == ‘>’) state = 6;

else state = fail();

break;

… /* cases 1-8 here */

case 9: c = nextchar();

if (isletter(c)) state = 10;


break;

case 10; c = nextchar();

if (isletter(c)) state = 10;

else if (isdigit(c)) state = 10;

else state = 11;

break;

case 11; retract(1); install_id();

return ( gettoken() );

… /* cases 12-24 here */


if (isdigit(c)) state = 26;


break;


if (isdigit(c)) state = 26;

else state = 27;

break;

case 27; retract(1); install_num();

return ( NUM ); } } }

Case numbers correspond to transition diagram states !

When Failures Occur:int state = 0, start = 0;

Int lexical_value;

/* to “return” second component of token */

Init fail()

{

forward = token_beginning;

switch (start) {

case 0: start = 9; break;




case 25: recover(); break;

default: /* compiler error */

}

return start;

}

Using a Lex Generator

Lex source prog lex.yy.c lex.l

lex.yy.c a.out

Input stream sequence of input.c tokens

Lex

Compiler

C

compiler

a.out

cs-338 compiler design

Documents