nlexical
TRANSCRIPT
-
8/3/2019 nlexical
1/46
Lexical Analysis
Textbook:Modern Compiler DesignChapter 2.1
-
8/3/2019 nlexical
2/46
A motivating example
Create a program that counts the number of lines ina given input text file
-
8/3/2019 nlexical
3/46
Solution
int num_lines = 0;
%%
\n ++num_lines;
. ;
%%
main()
{
yylex();printf( "# of lines = %d\n", num_lines);
}
-
8/3/2019 nlexical
4/46
Solution
int num_lines = 0;
%%
\n ++num_lines;
. ;
%%
main()
{
yylex();printf( "# of lines = %d\n", num_lines);
}
initial
;
newline
-
8/3/2019 nlexical
5/46
-
8/3/2019 nlexical
6/46
Basic Compiler Phases
Source program (string)
Fin. Assembly
lexical analysis
syntax analysis
semantic analysis
Tokens
Abstract syntax tree
Fr
ont-End
Back-End
Annotated Abstract syntax tree
-
8/3/2019 nlexical
7/46
Example Tokens
Type Examples
ID foo n_14 last
NUM 73 00 517 082
REAL 66.1 .5 10. 1e67 5.5e-10
IF if
COMMA ,
NOTEQ !=
LPAREN (
RPAREN )
-
8/3/2019 nlexical
8/46
Example NonTokens
Type Examples
comment /* ignored */
preprocessor directive #include
#define NUMS 5, 6
macro NUMS
whitespace \t \n \b
-
8/3/2019 nlexical
9/46
Example
void match0(char *s) /* find a zero */
{
if (!strncmp(s, 0.0, 3))
return 0. ;
}
VOID ID(match0) LPAREN CHAR DEREF ID(s)
RPAREN LBRACE IF LPAREN NOT ID(strncmp)
LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3)
RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE
EOF
-
8/3/2019 nlexical
10/46
input
program text (file)
output
sequence of tokens Read input file
Identify language keywords and standard identifiers
Handle include files and macros
Count line numbers Remove whitespaces
Report illegal symbols
Produce symbol table
Lexical Analysis (Scanning)
-
8/3/2019 nlexical
11/46
Simplifies the syntax analysis
And language definition
Modularity
Reusability Efficiency
Why Lexical Analysis
-
8/3/2019 nlexical
12/46
What is a token? Defined by the programming language
Can be separated by spaces
Smallest units
Defined by regular expressions
-
8/3/2019 nlexical
13/46
A simplified scanner for CToken nextToken()
{char c ;
loop: c = getchar();
switch (c){
case ` `:goto loop ;
case `;`: return SemiColumn;case `+`: c = ungetc() ;
switch (c) {
case `+': return PlusPlus ;
case '= return PlusEqual;default: ungetc(c);
return Plus; }
case `
-
8/3/2019 nlexical
14/46
Regular Expressions
-
8/3/2019 nlexical
15/46
Escape characters in regular expressions
\ converts a single operator into text
a\+
(a\+\*)+
Double quotes surround text
a+*+
Esthetically ugly
But standard
-
8/3/2019 nlexical
16/46
Regular Descriptions EBNF where non-terminals are fully defined before first
use
letterp[a-zA-Z]digit p[0-9]underscore p_letter_or_digit p letter|digitunderscored_tail p underscore letter_or_digit+identifierp letter letter_or_digit* underscored_tail
token description
A token name
A regular expression
-
8/3/2019 nlexical
17/46
The Lexical Analysis Problem Given
A set of token descriptions
An input string
Partition the strings into tokens(class, value)
Ambiguity resolution The longest matching token
Between two equal length tokens select the first
-
8/3/2019 nlexical
18/46
A Flex specification of C ScannerLetter [a-zA-Z_]Digit [0-9]
%%
[ \t] {;}
[\n] {line_count++;}
; { return SemiColumn;}
++ { return PlusPlus ;}
+= { return PlusEqual ;}
+ { return Plus}
while { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}
-
8/3/2019 nlexical
19/46
Flex Input
regular expressions and actions (C code)
Output
A scanner program that reads the input and
applies actions when input regular expression ismatched
flex
regular expressions
input program tokensscanner
-
8/3/2019 nlexical
20/46
-
8/3/2019 nlexical
21/46
Automatic Creation of Efficient Scanners
Nave approach on regular expressions
(dotted items)
Construct non deterministic finite
automaton over items
Convert to a deterministic
Minimize the resultant automaton
Optimize (compress) representation
-
8/3/2019 nlexical
22/46
Dotted Items
-
8/3/2019 nlexical
23/46
Example T p a+ b+
Input aab
After parsing aa
T p a+ b+
-
8/3/2019 nlexical
24/46
Item Types Shift item
In front of a basic pattern
A p (ab)+ c (de|fe)*
Reduce item
At the end of rhs
A p (ab)+ c (de|fe)* Basic item
Shift or reduce items
-
8/3/2019 nlexical
25/46
Character Moves For shift items character moves are simple
Tp E
cF
c
Tp E
c F
c
Digit p [0-9] 7 T p [0-9] 7
-
8/3/2019 nlexical
26/46
IMoves
For non-shift items the situation is more
complicated
What character do we need to see?
Where are we in the matching?
T p a*
T p (a*)
-
8/3/2019 nlexical
27/46
IMoves for Repetitions
Where can we get from T p E (R)* F
If R occurs zero times
T p E (R)* F
If R occurs one or more times
T p E ( R)* F
When R ends E ( R )* F
E (R)* F
E ( R)* F
I1
-
8/3/2019 nlexical
28/46
Slide 27
I1 IBM, 11/11/2003
-
8/3/2019 nlexical
29/46
I Moves
I2
-
8/3/2019 nlexical
30/46
Slide 28
I2 IBM, 11/11/2003
-
8/3/2019 nlexical
31/46
-
8/3/2019 nlexical
32/46
ConcurrentS
earch How to scan multiple token classes in a
single run?
-
8/3/2019 nlexical
33/46
I p [0-9]+
F p[0-9]*.[0-9]+
Input 3.1;
I p ([0-9])+ F p ([0-9])*.([0-9])+
I p ([0-9])+ F p ([0-9])*.([0-9])+ F p ([0-9])* .([0-9])+
I p ( [0-9] )+ F p ( [0-9] )*.([0-9])+
I p ([0-9])+ I p ( [0-9])+ F p ([0-9])*.([0-9])+ F p ( [0-9])* .([0-9])+
F p ( [0-9])*. ([0-9])+
-
8/3/2019 nlexical
34/46
The Need forB
acktracking A simple minded solution may require
unbounded backtracking
T1 p a+;T2 p a
Quadratic behavior
Does not occur in practice
A linear solution exists
-
8/3/2019 nlexical
35/46
A Non-Deterministic Finite State Machine
Add a production S p T1 | T2 | | Tn
Construct NDFA over the items
Initial state S p (T1 | T2 | | Tn)
For every character move, construct a character
transition
T p E cF
For every I move construct an I transition The accepting states are the reduce items
Accept the language defined by Ti
-
8/3/2019 nlexical
36/46
I Moves
I3
-
8/3/2019 nlexical
37/46
Slide 34
I3 IBM, 11/11/2003
-
8/3/2019 nlexical
38/46
I p [0-9]+
F p[0-9]*.[0-9]+ Sp(I|F)
Ip ([0-9]+)
Fp ([0-9]*).[0-9]+
Ip ([0-9])+
Ip ( [0-9])+
[0-9]
Ip ( [0-9])+
Fp ([0-9]*).[0-9]+ Fp ( [0-9]*) .[0-9]+
Fp ( [0-9] *).[0-9]+
[0-9]
Fp [0-9]*. ([0-9]+)
Fp [0-9]*. ([0-9]+)Fp [0-9]*. ( [0-9] +)
Fp [0-9]*. ( [0-9] +)
.
[0-9]
-
8/3/2019 nlexical
39/46
EfficientS
canners Construct Deterministic Finite Automaton
Every state is a set of items
Every transition is followed by an I-closure
When a set contains two reduce items select the onedeclared first
Minimize the resultant automaton Rejecting states are initially indistinguishable
Accepting states of the same token are indistinguishable Exponential worst case complexity
Does not occur in practice
Compress representation
-
8/3/2019 nlexical
40/46
I p [0-9]+
F p[0-9]*.[0-9]+
Sp(I|F)
Ip ([0-9]+)
Ip ([0-9])+
Fp ([0-9]*).[0-9]+
Fp ([0-9]*). [0-9]+
Fp ([0-9]*) . [0-9]+
Ip ( [0-9])+
Fp ( [0-9] *).[0-9]+
Ip ( [0-9])+
Ip ([0-9])+Fp ([0-9]*).[0-9]+
Fp ( [0-9]*) .[0-9]+
[0-9]
Sink
[^0-9.].|\n
[0-9]
Fp [0-9] *. ([0-9]+)Fp [0-9]*.([0-9]+)
.
.
Fp [0-9] *. ([0-9] +)
Fp [0-9]*.([0-9]+)
Fp [0-9]*.( [0-9]+)
[0-9]
[0-9]
[^0-9]
[^0-9]
[^0-9.]
-
8/3/2019 nlexical
41/46
A Linear-Time Lexical Analyzer
Procedure Get_Next_Token;
set Start of token to Read Index;
set End of last token to uninitialized
set Class of last token to uninitialized
set State to Initial
while state /=S
ink:Set ch to Input Char[Read Index];
Set state = H[state, ch];
if accepting(state):
set Class of last token to Class(state);
set End of last token to Read Index
set Read Index to Read Index + 1;
set token .class to Class of last token;
set token .repr to char[Start of token .. End last token];
set Read index to End last token + 1;
IMPORT Input Char [1..];
Set Read Index To 1;
-
8/3/2019 nlexical
42/46
S
canning 3.1;input state next
state
last
token
q3.1; 1 2 I
3 q.1; 2 3 I
3. q1; 3 4 F
3.1 q; 4 Sink F
1 Sink
[^0-9.]
2
[0-9]
[0-9]
3
..
[^0-9.]
4
[0-9]
[0-9]
[^0-9]I
F
-
8/3/2019 nlexical
43/46
S
canning aaa
input state next
state
last
token
qaaa$ 1 2 T1
a q aa$ 2 4 T1
a a q a$ 4 4 T1
a a a q $ 4 Sink T1
1T1 p a+;
T2 pa
Sink[^a]
[.\n]
2
[a]
[a]
3
;
[^a;]
T1
T1
[.\n]
4
;
[a]
[^a;]
-
8/3/2019 nlexical
44/46
Error Handling Illegal symbols
Common errors
-
8/3/2019 nlexical
45/46
-
8/3/2019 nlexical
46/46
Summary
For most programming languages lexical
analyzers can be easily constructed
automatically
Exceptions:
Fortran
PL/1
Lex/Flex/Jlex are useful beyond compilers