joey paquet, 2000, 20021 lecture 2 lexical analysis

Joey Paquet, 2000, 2002 1

Lecture 2

Lexical Analysis

Part I

Building a Lexical Analyzer

Roles of the Scanner

• Removal of comments– Comments are not part of the program’s

meaning• Multiple-line comments?• Nested comments?

• Case conversion– Is the lexical definition case sensitive?

• For identifiers• For keywords

• Removal of white spaces– Blanks, tabulars, carriage returns and line

feeds– Is it possible to identify tokens in a program

without spaces?

• Interpretation of compiler directives– #include, #ifdef, #ifndef and #define

are directives to “redirect the input” of the compiler

– May be done by a precompiler

• Communication with the symbol table– A symbol table entry is created when an

identifier is encountered– The lexical analyzer cannot create the

whole entries

• Preparation of the output listing– Output the analyzed code– Output error messages and warnings– Output a table of summary data

Tokens and Lexemes

• Token: An element of the lexical definition of the language.

• Lexeme: A sequence of characters identified as a token.

idrelopopenparifthenassignopsemi

distance,rate,time,a,x>=,<,==)ifthen=;

Design of a Lexical Analyzer

• Steps

1- Construct a set of regular expressions (REs) that define the form of all valid token2- Derive an NDFA from the REs3- Derive a DFA from the NDFA4- Translate to a state transition table5- Implement the table6- Implement the algorithm to interpret the table

Regular Expressions

id ::= letter(letter|digit)*

: { }s : {s | s in s^}a : {a}r | s : {r | r in r^} or {s | s in s^}s* : {sn | s in s^ and n>=0}s+ : {sn | s in s^ and n>=1}

Derive NDFA from REs

• Could derive DFA from REs but: – Much easier to do NDFA, then derive DFA– No standard way of deriving DFAs from Res– Use Thompson’s construction (Louden’s)

letter

Joey Paquet, 2000, 2002 10

Derive DFA from NDFA

• Use subset construction (Louden’s)• May be optimized• Easier to implement:

– No edges– Determinist (no backtracking)

letter

[other]

letter

Joey Paquet, 2000, 2002 11

Generate State Transition Table

letter

[other]0 1 2

letter digit other final0 1 N1 1 1 2 N2 Y

Joey Paquet, 2000, 2002 12

Implementation Concerns

• Backtracking– Principle : A token is normally recognized

only when the next character is read.– Problem : Maybe this character is part of the

next token. – Example : x<1. “<“ is recognized only

when “1” is read. In this case, we have to backtrack on character to continue token recognition.

– Can include the occurrence of these cases in the state transition table.

Joey Paquet, 2000, 2002 13

Implementation Concerns

• Ambiguity– Problem : Some tokens’ lexemes are

subsets of other tokens. – Example :

•n-1. Is it <n><-><1> or <n><-1>?– Solutions :

• Postpone the decision to the syntactic analyzer• Do not allow sign prefix to numbers in the lexical

specification• Interact with the syntactic analyzer to find a

solution. (Induces coupling)

Joey Paquet, 2000, 2002 14

Example

• Alphabet : – {:, *, =, (, ), <, >, {, }, [a..z], [0..9]}

• Simple tokens :– {(, ), {, }, :, <, >}

• Composite tokens :– {:=, >=, <=, <>, (*, *)}

• Words : – id ::= letter(letter | digit)*– num ::= digit*

Joey Paquet, 2000, 2002 15

Example

• Ambiguity problems:

• Backtracking: – Must back up a character when we read a

character that is part of the next token.– Occurences are coded in the table

Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *)

Joey Paquet, 2000, 2002 16

8 10 11

Final state with backtracking

Final state

Joey Paquet, 2000, 2002 17

l d { } ( * ) : = < > sp backup

1 2 4 6 19 8 19 19 12 19 14 17 1

2 2 2 3 3 3 3 3 3 3 3 3 3

3 1 1 1 1 1 1 1 1 1 1 1 1 yes [id]

4 5 4 5 5 5 5 5 5 5 5 5 5

5 1 1 1 1 1 1 1 1 1 1 1 1 yes [num]

6 6 6 6 7 6 6 6 6 6 6 6 6

7 1 1 1 1 1 1 1 1 1 1 1 1 no [{…}]

8 20 20 20 20 20 9 20 20 20 20 20 20

9 9 9 9 9 9 10 9 9 9 9 9 9

10 9 9 9 9 9 9 11 9 9 9 9 9

11 1 1 1 1 1 1 1 1 1 1 1 1 no [(*…*)]

12 20 20 20 20 20 20 20 20 13 20 20 20

13 1 1 1 1 1 1 1 1 1 1 1 1 no [:=]

14 20 20 20 20 20 20 20 20 15 20 16 20

15 1 1 1 1 1 1 1 1 1 1 1 1 no [<=]

16 1 1 1 1 1 1 1 1 1 1 1 1 no [<>]

17 20 20 20 20 20 20 20 20 18 20 20 20

18 1 1 1 1 1 1 1 1 1 1 1 1 no [>=]

19 1 1 1 1 1 1 1 1 1 1 1 1 no

20 1 1 1 1 1 1 1 1 1 1 1 1 yes [various]

canner

(Table

Joey Paquet, 2000, 2002 18

nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token)

Table-driven Scanner (Algorithm)

Joey Paquet, 2000, 2002 19

Table-driven Scanner

• nextToken() – Extract the next token in the program

(called by syntactic analyzer)

• nextChar()– Read the next character (except spaces) in

the input program

• backupChar()– Backs up one character in the input file

Joey Paquet, 2000, 2002 20

Table-driven Scanner

• isFinalState(state)– Returns TRUE if state is a final state

• table(state, column)– Returns the value corresponding to

[state, column] in the state transition table.

• createToken()– Creates and returns a structure that

contains the token type, its location in the source code, and its value (for literals).

Joey Paquet, 2000, 2002 21

nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar() while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar()

Hand-w

Joey Paquet, 2000, 2002 22

"{": c = nextChar() while ( c != "}" ) do c = nextChar() token = createToken(LBRACK,null) "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() c = nextChar() until ( c != ")" ) return else token = createToken(LPAR,null) ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar()

Hand-w

Joey Paquet, 2000, 2002 23

"<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "}": token = createToken(RBRACK,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token

Hand-w

Joey Paquet, 2000, 2002 24

Part II

Error recovery in Lexical Analysis

Joey Paquet, 2000, 2002 25

Possible Lexical Errors

• Depends on the accepted conventions:– Invalid character– letter not allowed to terminate a number– numerical overflow– identifier too long– end of line before end of string– Are these lexical errors?

Joey Paquet, 2000, 2002 26

Accepted or Not?

• 123a– <Error< or <num><id>

• 123456789012345678901234567– <Error> related to machine’s limitations

• “Hello world <CR>– Either <CR> is skipped or <Error>

• ThisIsAVeryLongVariableName = 1 – Limit identifier length?

Joey Paquet, 2000, 2002 27

Error Recovery Techniques

• Finding only the first error is not acceptable

• Panic Mode: – Skip characters until a valid character is

• Guess Mode: – do pattern matching between erroneous

strings and valid strings– Example: (beggin vs. begin)– Rarely implemented

Joey Paquet, 2000, 2002 28

Conclusions

Joey Paquet, 2000, 2002 29

Possible Implementations

• Lexical Analyzer Generator (e.g. Lex)+safe, quick– Must learn software, unable to handle

unusual situations

• Table-Driven Lexical Analyzer+general and adaptable method, same

function can be used for all table-driven lexical analyzers

– Building transition table can be tedious and error-prone

Joey Paquet, 2000, 2002 30

Possible Implementations

• Hand-written+Can be optimized, can handle any unusual

situation, easy to build for most languages– Error-prone, not adaptable or maintainable

Joey Paquet, 2000, 2002 31

Lexical Analyzer’s Modularity

• Why should the Lexical Analyzer and the Syntactic Analyzer be separated? – Modularity/Maintainability : system is

more modular, thus more maintainable – Efficiency : modularity = task

specialization = easier optimization– Reusability : can change to whole lexical

analyzer without changing other parts

joey paquet, 2000, 20021 lecture 2 lexical analysis

Documents

gray joey main - riley blake designs€¦ · c8490 gray...

joey paquet, 2000-20141 lecture 11 code generation

powerpoint andrianantooandro paquet - petersen

coast guard - capt joey designs - capt joey designs

paquet d'esponsarització

marketing essentials 20021

francis paquet: visualisations innovantes

cisco paquet 1-6

plaquette - paquet hygiène

joey paquet, 2000, 2002, 2007, 20081 concordia university...

le paquet de biscuits

paquet hygiene

joey lomeli joey@playfieldusa.com get to know playfield...

comparative studies of programming languages, comp6411...

kultuuripealinn%20021 18 %20veebruar%202011

joey paquet, 2000, 2002, 20081 lecture 3 syntactic analysis...

guia paquet tracer

&20021:($/7+6(& 21'$5

31st july talk (20021)

copyright 2006 pearson addison-wesley, 2008, 2013 joey...