joey paquet, 2000, 20021 lecture 2 lexical analysis
Post on 13-Dec-2015
226 Views
Preview:
TRANSCRIPT
Joey Paquet, 2000, 2002 1
Lecture 2
Lexical Analysis
Joey Paquet, 2000, 2002 2
Part I
Building a Lexical Analyzer
Joey Paquet, 2000, 2002 3
Roles of the Scanner
• Removal of comments– Comments are not part of the program’s
meaning• Multiple-line comments?• Nested comments?
• Case conversion– Is the lexical definition case sensitive?
• For identifiers• For keywords
Joey Paquet, 2000, 2002 4
Roles of the Scanner
• Removal of white spaces– Blanks, tabulars, carriage returns and line
feeds– Is it possible to identify tokens in a program
without spaces?
• Interpretation of compiler directives– #include, #ifdef, #ifndef and #define
are directives to “redirect the input” of the compiler
– May be done by a precompiler
Joey Paquet, 2000, 2002 5
Roles of the Scanner
• Communication with the symbol table– A symbol table entry is created when an
identifier is encountered– The lexical analyzer cannot create the
whole entries
• Preparation of the output listing– Output the analyzed code– Output error messages and warnings– Output a table of summary data
Joey Paquet, 2000, 2002 6
Tokens and Lexemes
• Token: An element of the lexical definition of the language.
• Lexeme: A sequence of characters identified as a token.
idrelopopenparifthenassignopsemi
distance,rate,time,a,x>=,<,==)ifthen=;
Joey Paquet, 2000, 2002 7
Design of a Lexical Analyzer
• Steps
1- Construct a set of regular expressions (REs) that define the form of all valid token2- Derive an NDFA from the REs3- Derive a DFA from the NDFA4- Translate to a state transition table5- Implement the table6- Implement the algorithm to interpret the table
Joey Paquet, 2000, 2002 8
Regular Expressions
id ::= letter(letter|digit)*
: { }s : {s | s in s^}a : {a}r | s : {r | r in r^} or {s | s in s^}s* : {sn | s in s^ and n>=0}s+ : {sn | s in s^ and n>=1}
Joey Paquet, 2000, 2002 9
Derive NDFA from REs
• Could derive DFA from REs but: – Much easier to do NDFA, then derive DFA– No standard way of deriving DFAs from Res– Use Thompson’s construction (Louden’s)
letter
digit
letter
Joey Paquet, 2000, 2002 10
Derive DFA from NDFA
• Use subset construction (Louden’s)• May be optimized• Easier to implement:
– No edges– Determinist (no backtracking)
letter
letter
digit
[other]
digit
letter
letter
letter
letter
digit
digit
Joey Paquet, 2000, 2002 11
Generate State Transition Table
letter
letter
digit
[other]0 1 2
letter digit other final0 1 N1 1 1 2 N2 Y
Joey Paquet, 2000, 2002 12
Implementation Concerns
• Backtracking– Principle : A token is normally recognized
only when the next character is read.– Problem : Maybe this character is part of the
next token. – Example : x<1. “<“ is recognized only
when “1” is read. In this case, we have to backtrack on character to continue token recognition.
– Can include the occurrence of these cases in the state transition table.
Joey Paquet, 2000, 2002 13
Implementation Concerns
• Ambiguity– Problem : Some tokens’ lexemes are
subsets of other tokens. – Example :
•n-1. Is it <n><-><1> or <n><-1>?– Solutions :
• Postpone the decision to the syntactic analyzer• Do not allow sign prefix to numbers in the lexical
specification• Interact with the syntactic analyzer to find a
solution. (Induces coupling)
Joey Paquet, 2000, 2002 14
Example
• Alphabet : – {:, *, =, (, ), <, >, {, }, [a..z], [0..9]}
• Simple tokens :– {(, ), {, }, :, <, >}
• Composite tokens :– {:=, >=, <=, <>, (*, *)}
• Words : – id ::= letter(letter | digit)*– num ::= digit*
Joey Paquet, 2000, 2002 15
Example
• Ambiguity problems:
• Backtracking: – Must back up a character when we read a
character that is part of the next token.– Occurences are coded in the table
Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *)
Joey Paquet, 2000, 2002 16
l d
2 3
4 5
6 7
8 10 11
12 13
14 15
16
17 18
19
20
9
l
d
{
(
}
d
* * )
:
<
>
1sp
=
>
=
=
Final state with backtracking
Final state
Joey Paquet, 2000, 2002 17
l d { } ( * ) : = < > sp backup
1 2 4 6 19 8 19 19 12 19 14 17 1
2 2 2 3 3 3 3 3 3 3 3 3 3
3 1 1 1 1 1 1 1 1 1 1 1 1 yes [id]
4 5 4 5 5 5 5 5 5 5 5 5 5
5 1 1 1 1 1 1 1 1 1 1 1 1 yes [num]
6 6 6 6 7 6 6 6 6 6 6 6 6
7 1 1 1 1 1 1 1 1 1 1 1 1 no [{…}]
8 20 20 20 20 20 9 20 20 20 20 20 20
9 9 9 9 9 9 10 9 9 9 9 9 9
10 9 9 9 9 9 9 11 9 9 9 9 9
11 1 1 1 1 1 1 1 1 1 1 1 1 no [(*…*)]
12 20 20 20 20 20 20 20 20 13 20 20 20
13 1 1 1 1 1 1 1 1 1 1 1 1 no [:=]
14 20 20 20 20 20 20 20 20 15 20 16 20
15 1 1 1 1 1 1 1 1 1 1 1 1 no [<=]
16 1 1 1 1 1 1 1 1 1 1 1 1 no [<>]
17 20 20 20 20 20 20 20 20 18 20 20 20
18 1 1 1 1 1 1 1 1 1 1 1 1 no [>=]
19 1 1 1 1 1 1 1 1 1 1 1 1 no
20 1 1 1 1 1 1 1 1 1 1 1 1 yes [various]
Table
-dri
ven S
canner
(Table
)
Joey Paquet, 2000, 2002 18
nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token)
Table-driven Scanner (Algorithm)
Joey Paquet, 2000, 2002 19
Table-driven Scanner
• nextToken() – Extract the next token in the program
(called by syntactic analyzer)
• nextChar()– Read the next character (except spaces) in
the input program
• backupChar()– Backs up one character in the input file
Joey Paquet, 2000, 2002 20
Table-driven Scanner
• isFinalState(state)– Returns TRUE if state is a final state
• table(state, column)– Returns the value corresponding to
[state, column] in the state transition table.
• createToken()– Creates and returns a structure that
contains the token type, its location in the source code, and its value (for literals).
Joey Paquet, 2000, 2002 21
nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar() while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar()
Hand-w
ritt
en
Sca
nner
Joey Paquet, 2000, 2002 22
"{": c = nextChar() while ( c != "}" ) do c = nextChar() token = createToken(LBRACK,null) "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() c = nextChar() until ( c != ")" ) return else token = createToken(LPAR,null) ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar()
Hand-w
ritt
en
Sca
nner
Joey Paquet, 2000, 2002 23
"<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "}": token = createToken(RBRACK,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token
Hand-w
ritt
en
Sca
nner
Joey Paquet, 2000, 2002 24
Part II
Error recovery in Lexical Analysis
Joey Paquet, 2000, 2002 25
Possible Lexical Errors
• Depends on the accepted conventions:– Invalid character– letter not allowed to terminate a number– numerical overflow– identifier too long– end of line before end of string– Are these lexical errors?
Joey Paquet, 2000, 2002 26
Accepted or Not?
• 123a– <Error< or <num><id>
• 123456789012345678901234567– <Error> related to machine’s limitations
• “Hello world <CR>– Either <CR> is skipped or <Error>
• ThisIsAVeryLongVariableName = 1 – Limit identifier length?
Joey Paquet, 2000, 2002 27
Error Recovery Techniques
• Finding only the first error is not acceptable
• Panic Mode: – Skip characters until a valid character is
read
• Guess Mode: – do pattern matching between erroneous
strings and valid strings– Example: (beggin vs. begin)– Rarely implemented
Joey Paquet, 2000, 2002 28
Conclusions
Joey Paquet, 2000, 2002 29
Possible Implementations
• Lexical Analyzer Generator (e.g. Lex)+safe, quick– Must learn software, unable to handle
unusual situations
• Table-Driven Lexical Analyzer+general and adaptable method, same
function can be used for all table-driven lexical analyzers
– Building transition table can be tedious and error-prone
Joey Paquet, 2000, 2002 30
Possible Implementations
• Hand-written+Can be optimized, can handle any unusual
situation, easy to build for most languages– Error-prone, not adaptable or maintainable
Joey Paquet, 2000, 2002 31
Lexical Analyzer’s Modularity
• Why should the Lexical Analyzer and the Syntactic Analyzer be separated? – Modularity/Maintainability : system is
more modular, thus more maintainable – Efficiency : modularity = task
specialization = easier optimization– Reusability : can change to whole lexical
analyzer without changing other parts
top related