lexical analysis1

8/2/2019 Lexical Analysis1

1/44

1

Lexical Analysis Recognize tokens and ignore white spaces,

comments

Error reporting

Model using regular expressions

Recognize using Finite State Automata

Generates token stream


2/44

2

Lexical Analysis Sentences consist of string of tokens (a syntactic category)

for example number, identifier, keyword, string

Sequences of characters in a token is lexemefor example 100.01, counter, const, How are you?

Rule of description is patternfor example letter(letter/digit)*

Discard whatever does not contribute to parsing like white spaces(blanks, tabs, newlines) and comments

construct constants: convert numbers to token num and passnumber as its attribute for example integer 31 becomes

recognize keyword and identifiersfor example counter = counter + incrementbecomes id = id + id /*check if id is a keyword*/


3/44

3

Interface to other phases

Push back is required due to lookaheadfor example > = and >

It is implemented through a buffer Keep input in a buffer Move pointers over the input

LexicalAnalyzer

SyntaxAnalyzer

InputAsk fortoken

TokenRead

characters

Push backExtra

characters


4/44

4

Approaches to implementation

Use assembly languageMost efficient but most difficult toimplement

Use high level languages like CEfficient but difficult to implement

Use tools like lex, flexEasy to implement but not as efficient asthe first two cases


5/44

5

Construct a lexical analyzer Allow white spaces, numbers and arithmetic

operators in an expression

Return tokens and attributes to the syntaxanalyzer

A global variable tokenval is set to thevalue of the number

Design requires that A finite set of tokens be defined Describe strings belonging to each token


6/44

6

#include #include int lineno = 1;int tokenval = NONE;int lex() {

int t;while (1) {t = getchar ();if (t == || t == \t);

else if (t == \n) lineno = lineno + 1;else if (isdigit (t) ) {

tokenval = t 0 ;t = getchar ();while (isdigit(t)) {

tokenval = tokenval * 10 + t 0 ;

t = getchar();

}ungetc(t,stdin);return num;

}else { tokenval = NONE; return t; }

}

}


7/44

7

Problems

Scans text character by character

Look ahead character determines what

kind of token to read and when thecurrent token ends

First character cannot determine whatkind of token we are going to read


8/44

8

Symbol Table Stores information for subsequent phases

Interface to the symbol table Insert(s,t): save lexeme s and token t and return pointer Lookup(s): return index of entry for lexeme s or 0 if s is

not found

Implementation of symbol table Fixed amount of space to store lexemes. Not

advisable as it waste space.

Store lexemes in a separate array. Each lexeme isseparated by eos. Symbol table has pointers tolexemes.


9/44

9

Fixed space for lexemes Other attributes

Usually 32 bytes

lexeme1 lexeme2eos eos lexeme3

Other attributesUsually 4 bytes


10/44

10

How to handle keywords?

Consider token DIV and MOD with lexemesdiv and mod.

Initialize symbol table with insert( div ,DIV ) and insert( mod , MOD).

Any subsequent lookup returns a nonzero

value, therefore, cannot be used asidentifier.


11/44

11

Difficulties in design of lexicalanalyzers

Is it as simple as it sounds?

Lexemes in a fixed position. Fix format vs. freeformat languages

Handling of blanks in Pascal blanks separate identifiers

in Fortran blanks are important only in literal strings forexample variable counter is same as count er

Another exampleDO 10 I = 1.25 DO10I=1.25

DO 10 I = 1,25 DO10I=1,25


12/44

12

The first line is variable assignmentDO10I=1.25

second line is beginning of aDo loop

Reading from left to right one can notdistinguish between the two until the ; or . isreached

Fortran white space and fixed format rulescame into force due to punch cards anderrors in punching


13/44

13


14/44

14


15/44

15

PL/1 Problems Keywords are not reserved in PL/1

if then then then = else else else = thenif if then then = then + 1

PL/1 declarations Declare(arg1,arg2,arg3,.,argn)

Can not tell whether Declare is a keyword or arrayreference until after )

Requires arbitrary lookahead and very largebuffers. Worse, the buffers may have to bereloaded.


16/44

16

Problem continues even today!!

C++ template syntax: Foo

C++ stream syntax: cin >> var;

Nested templates:

Foo

Can these problems be resolved by lexicalanalyzers alone?


17/44

17

How to specify tokens?

How to describe tokens2.e0 20.e-01 2.000

How to break text into token

if (x==0) a = x


18/44

18

How to describe tokens?

Programming language tokens can bedescribed by regular languages

Regular languages Are easy to understand There is a well understood and useful theory

They have efficient implementation

Regular languages have been discussed ingreat detail in the Theory of Computationcourse


19/44

19

Operations on languages

L U M = {s | s is in L or s is in M}

LM = {st | s is in L and t is in M}

L* = Union of Li such that 0 i Where L0 = and Li = L i-1 L


20/44

20

Example Let L = {a, b, .., z} and D = {0, 1, 2, 9} then

LUD is set of letters and digits

LD is set of strings consisting of a letter followed

by a digit

L* is a set of all strings of letters including

L(LUD)* is set of all strings of letters and digitsbeginning with a letter

D+ is the set of strings of one or more digits


21/44

21

Notation Let be a set of characters. A language

over is a set of strings of charactersbelonging to

A regular expression r denotes a language

L(r)

Rules that define the regular expressionsover

is a regular expression that denotes {} theset containing the empty string If a is a symbol in then a is a regular

expression that denotes {a}


22/44

22

If r and s are regular expressions denotingthe languages L(r) and L(s) then

(r)|(s) is a regular expression denoting L(r)

U L(s) (r)(s) is a regular expression denoting

L(r)L(s)

(r)* is a regular expression denoting (L(r))*

(r) is a regular expression denoting L(r)


23/44

23

Let = {a, b}

The regular expression a|b denotes the set {a, b}

The regular expression (a|b)(a|b) denotes {aa, ab,ba, bb}

The regular expression a* denotes the set of allstrings {, a, aa, aaa, }

The regular expression (a|b)* denotes the set ofall strings containing and all strings of as and bs

The regular expression a|a*b denotes the setcontaining the string a and all strings consisting ofzero or more as followed by a b


24/44

24

Precedence and associativity

*, concatenation, and | are left associative * has the highest precedence

Concatenation has the second highestprecedence

| has the lowest precedence


25/44

25

How to specify tokens Regular definitions

Let ri be a regular expression and di be adistinct name

Regular definition is a sequence of definitionsof the formd1 r1d2 r2

..dn rn

Where each ri is a regular expression over U

{d1, d2, , di-1}


26/44

26

Examples My fax number

91-(512)-259-7586

= digits U {-, (, ) }

Country

digit+

Area ( digit+ )

Exchange digit+

Phone digit+

Number

country - area - exchange - phone

digit

2

digit3

digit3

digit4


27/44

27

Examples

My email [email protected]

= letter U {@, . }

Letter a| b| | z| A| B| | Z

Name letter+

Address name @ name . name . name


28/44

28

Examples Identifier

letter a| b| |z| A| B| | Zdigit 0| 1| | 9identifier letter(letter|digit)*

Unsigned number in Pascaldigit 0| 1| |9digits digit+

fraction . digits | exponent (E ( + | - | ) digits) | number digits fraction exponent


29/44

29

Regular expressions in specifications

Regular expressions describe many useful

languages

Regular expressions are only specifications;implementation is still required

Given a string s and a regular expression R,does s L(R) ?

Solution to this problem is the basis of the lexical

analyzers

However, just the yes/no answer is not important

Goal: Partition the input into tokens


30/44

30

1. Construct R matching all lexemes of all tokens R = R1 + R2 + R3 + ..

1. Let input be x1xn for 1 i n check x1xi L(R)

1. x1xi L(R) x1xi L(Rj) for some j

2. Write a regular expression for lexemes of eachtoken

number digit+

identifier letter(letter|digit)+

smallest such j is token class of x1xi

1. Remove x1xi from input; go to (3)


31/44

31

The algorithm gives priority to tokens listedearlier Treats if as keyword and not identifier

How much input is used? What if

x1xi L(R) x1xj L(R)

Pick up the longest possible string in L(R) The principle of maximal munch

Regular expressions provide a concise and usefulnotation for string patterns

Good algorithms require single pass over the input


32/44

32

How to break up text

Elsex=0

Regular expressions alone are not enough

Normally longest match wins

Ties are resolved by prioritizing tokens

Lexical definitions consist of regular

definitions, priority rules and maximali i

else x = 0 elsex = 0


33/44

33

Finite Automata Regular expression are declarative specifications

Finite automata is implementation A finite automata consists of

An input alphabet belonging to A set of states S A set of transitions statei statej

A set of final states F A start state n

Transition s1 s2 is read:in state s1 on input a go to state s2

If end of input is reached in a final state then accept

Otherwise, reject

input

a


34/44

34

Pictorial notation A state

A final state

Transition

Transition from state i to state j oninput a i j

a


36/44

36

Transition diagram for relops

>=

other

token is relop, lexeme is >=

token is relop, lexeme is >*

>

=

=

=

other

other

*

*

token is relop, lexeme is >=

token is relop, lexeme is >

token is relop, lexeme is

lexical analysis1

Documents