some slides borrowed from m scherger - texas a&m ...sking/courses/compilers/slides/lex.pdflex...
TRANSCRIPT
Lex/Flex: A Scanner Generator in C
Fall 2012 Introduction to lex (or flex) 2
Regular Expression
Nondeterministic Finite Automaton
Deterministic Finite Automaton
Table-driven Scanner
So why not do this with a tool?
Thomson’s Construction
“Subset” Construction
Lex
Fall 2012 Introduction to lex (or flex) 3
Lex is a such tool for creating lexical analyzers
M. E. Lesk and E. Schmidt 1975
Lexical analyzers tokenize input streams
Regular expressions define tokens
Tokens are the terminals of a language
Converts regular expressions into DFAs
DFAs are implemented as table driven state machines
Some versions of Lex are proprietary and so not all versions of *nix come with an open source version
flex – Fast Lexical Analyzer is an open source version
Vern Paxson
The Basic Process
Fall 2012 Introduction to lex (or flex) 4
Lex
compiler
C
Compiler
a.out
Lex source program
any.l
lex.yy.c
Input stream Sequence
of tokens
a.out
lex.yy.c
Format of a lex File
Fall 2012 Introduction to lex (or flex) 5
Definitions
%%
Rules
%%
User code
1st section holds declarations of simple name definitions and start conditions
2nd section holds pattern-action pairs
3rd section is copied directly to lex.yy.c C code and comments
Typical file extensions: .l .lex .flex
Compiling and Running
Fall 2012 Introduction to lex (or flex) 6
> flex linenos.flex
> gcc lexyy.c -lfl
> a.out < infile > outfile
yywrap()
issue
Regular Expressions and Lex
Fall 2012 Introduction to lex (or flex) 7
A regular expression is an expression that matches sets of strings (the “language” of the regular expression).
In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations choice (|),
concatenation (no operator),
and repetition (*).
A regular expression may also contain certain other metasymbols: parentheses for grouping (to change precedence, just as in
arithmetic)
others as needed to extend the operator set in useful ways
Regular Expressions in Lex
Fall 2012 Introduction to lex (or flex) 8
c - c is a single character
Matches the character c
\c – c is a single character
Use this to escape special characters
“str” - str is a string
Matches entire string str
[str]- str is a string
Matches any single character from str
RE Matches
A A
x x
d d
\. .
\n Newline
\t tab
“Abc” Abc
“The” The
[aeiou] Lowercase vowels
[abcde] The letters a to e
Regular Expressions – Character Classes
Fall 2012 Introduction to lex (or flex) 9
[x-y] – x and y are characters
All characters in the range x-y
These can be combined
[^str] – str is a string
RE Matches
[a-z] All lowercase characters
[0-9] All digits
[a-df-z] lowercase characters except e
[a-z0-9A-Z] Alphanumeric characters
[A-Zaeiou] Upper case letters and lc vowels
[^ \n\t] all non whitespace
[^aeiou] matches anything but lowercase vowels
Regular Expressions
Fall 2012 Introduction to lex (or flex) 10
p* – p is a pattern
Zero or more occurrences of p
p+ – p is a pattern
One or more occurrences of p
A* A AA AAA ....
r* r rr ...
ab*c* a ab ac abb abc acc abbb abbc abcc accc ...
A+ A AA AAA AAAA ...
ab+ ab abb abbb ....
a*b+ b ab bb aab abb bbb ..
Regular Expressions
Fall 2012 Introduction to lex (or flex) 11
p? - p is a pattern
Zero or one occurrences of p
p{m,n} – p is a pattern, m and n are ints
Matches m through n occurrences of p
if ,n is missing, n = m, if just n is missing n = ∞
A? A
ab?c? a ab ac abc
a{1,3} a aa aaa
a{1,1} a
a{1} a
a{3,} aaa aaaa aaaaa …
Regular Expressions
Fall 2012 Introduction to lex (or flex) 12
p1p2 – p1 and p2 are patterns
Matches p1 followed by p2
(p) - p is a pattern
Used to override precedence (group things)
p1|p2 – p1 and p2 are patterns
Matches either p1 or p2
Notice precedence
ab ab
a+b+ ab aab abb
(abc)+ abc abcabc abcabcabc …
abc+ abc abcc abccc …
a|an|the a an the
ba|ed ba ed
b(a|e)d bed bad
Regular Expression - Extra Things
Fall 2012 Introduction to lex (or flex) 13
p1/p2 – p1 and p2 are patterns
Matches p1 only if it's followed by p2
p2 is not part of yytext
RE: a+/bc
Input: aaabc bc aaaad
matches first aaa only..
^p – p is a pattern
matches p only if it is at the start of a line
p$ – p is a pattern
matches p only if it is at the end of a line
Two more complex examples
Fall 2012 Introduction to lex (or flex) 14
[-+]?[0-9]+(\.[0-9]+)?([Ee][-+]?[0-9]+)?
or:
nat = [0-9]+
signedNat = [-+]? nat
number = signedNat(\. nat)?
([Ee] signedNat)?
C comments
/\*/*(\**[^/*]/*)*\**\*/
Format of a lex File
Fall 2012 Introduction to lex (or flex) 16
Definitions
%%
Rules
%%
User code
1st section holds declarations of simple name definitions and start conditions
2nd section holds pattern-action pairs
3rd section is copied directly to lex.yy.c
C code and comments
Definitions
Fall 2012 Introduction to lex (or flex) 17
Definitions are of the form:
name definition
A name begins with a letter or underscore followed by 0 or more letters, digits, '-', or '_'.
You access it with {name}
Example definitions:
Digit [0-9]
Char [A-Z]
AlphaNum [a-zA-Z0-9]
ws [ \n\t]
IntegerConst [0-9]+
Definitions Example
Fall 2012 Introduction to lex (or flex) 18
Digit [0-9]
Char [a-zA-Z]
AlphaNum [a-zA-Z0-9]
%%
{Digit}+”.”{Digit}+
({Char}|_)({AlphaNum}|[_-])* {printf(“A name '%s'\n”, yytext);}
%%
Rules
Fall 2012 Introduction to lex (or flex) 19
Rules are of the form:
pattern action
pattern is the RE to match and action is what to do when it is matched
Default rule is to echo the input
Lex matches the longest string possible
If a tie, it matches the 1st rule in the spec
Actions can be empty – do nothing
Actions can be complex
Use {} if multi-lined
don't forget ';'s
yytext contains the string matched
Example Rules
Fall 2012 Introduction to lex (or flex) 20
\n linecount++;
[0-9]+ sum+=atoi(yytext);
{ws}+
a|an|the printf(“found an article\n”);
[aeiou]+ { printf(“A string of vowels\n”); vcnt++; }
Predefined Rules
Fall 2012 Introduction to lex (or flex) 21
ECHO
Copy yytext to output
[a-z]+ ECHO;
REJECT
Go to the next alternative, that is the second choice rule to be selected and it’s action taken
she s++;
he h++;
Won’t count the imbedded he
she {s++; REJECT;}
he {h++; REJECT;} \n
But this will
Rules Example
ex1.l The commands
Fall 2012 Introduction to lex (or flex) 22
%%
a*b printf(“Token 1 found\n”);
c+ printf(“Token 2 found\n”);
%%
main() {
yylex();
}
lex ex1.l produces lex.yy.c
cc -o ex1 lex.yy.c – ll create executable
May need –lfl if using flex
./ex1 to execute
aaaaaaabbccd
Token 1 found
Token 1 found
Token 2 found
d
Default is stdin and
stdout so type
aaaaaaaabbccd <return>
An Example Count chars, words, lines
Fall 2012 Introduction to lex (or flex) 23
%{
unsigned ccnt=0, wcnt = 0, lcnt = 0;
%}
word [^ \t\n]+
eol \n
%%
{word}{wcnt++;ccnt+=yyleng;}
{eol} {ccnt++;lcnt++;}
. ccnt++;
%%
main() {yylex(); }
The %{ %} pair allow you
to make declarations for
your lexer
About lex
Fall 2012 Introduction to lex (or flex) 24
Lex uses some predefined functions stored in lex library
(link with -ll or -lfl)
By default lex copies input to output
By default lex reads stdin, writes stdout
Lex reads its input (a lex script) and produced lex.yy.c
Use %{ and %} in definitions section to declare globals
and put #includes
You can use flex instead
Not all 'lex'es are equal!
Man page has more info!
Example 1: The Simplest Example
Fall 2012 Introduction to lex (or flex) 25
The simplest example of a lex program is a scanner that acts like the UNIX `cat`program
%%
. |\n ECHO;
%%
Or it could be written as…
%%
. ECHO;
\n ECHO;
%%
Flex Internal Names
Fall 2012 Introduction to lex (or flex) 27
Lex internal name Meaning/Use
lex.yy.c or lexyy.c Lex output file name yylex Lex scanning routine yytext string matched on current action yyleng length of yytext yyin Lex input file (default: stdin) yyout Lex output file (default: stdout) input Lex buffered input routine ECHO Lex default action (print yytext
to yyout)
See the Flex documentation for others
Flex Operational Conventions
Fall 2012 Introduction to lex (or flex) 28
yylex() runs until it is stopped by a return
ambiguity is resolved by order
any text not explicitly matched is echoed to stdout
EOF is automatically matched and returns 0 from yylex()
(unless yywrap() is suitably defined)
yylex() returns an int which can be a token
Example 2: wc
Fall 2012 Introduction to lex (or flex) 29
Here is a scanner that is similar to the UNIX `wc` command
%{
unsigned charCount = 0, wordCount = 0, lineCount = 0;
%}
%%
[^ \t\n] { wordCount++; charCount += yyleng; }
\n { charCount++; lineCount++; }
. charCount++;
%%
int main()
{
yylex();
printf("%d %d %d\n",charCount, wordCount, lineCount);
return 0;
}
Example 3: Line Numbers (p. 84)
Fall 2012 Introduction to lex (or flex) 30
%{
/* a Lex program that adds line numbers
to lines of stdin, printing to stdout */
#include <stdio.h>
int lineno = 1;
%}
line .*\n
%%
{line} { printf("%5d %s",lineno++,yytext); }
%%
main()
{ yylex(); return 0; }
Example 4: (pp. 86-87)
Fall 2012 Introduction to lex (or flex) 31
%{/* Selects only lines that end or begin with the letter 'a'. */
#include <stdio.h>
%}
ends_with_a .*a\n
begins_with_a a.*\n
%%
{ends_with_a} ECHO;
{begins_with_a} ECHO;
.*\n ;
%%
main()
{ yylex(); return 0; }
Example 5: wc again!
Fall 2012 Introduction to lex (or flex) 32
%{
unsigned charCount = 0, wordCount = 0, lineCount = 0;
%}
word [^ \t\n]+
eol \n
%%
{word} { wordCount++; charCount += yyleng; }
{eol} { charCount++; lineCount++; }
. charCount++;
Example 5: wc again! (cont.)
Fall 2012 Introduction to lex (or flex) 33
%%
int main(int argc,char *argv[])
{
if (argc > 1) {
FILE *file;
file = fopen(argv[1], "r");
if (!file) {
fprintf(stderr,"could not open %s\n",argv[1]);
exit(1);
}
yyin = file;
}
yylex();
printf("%d %d %d\n",charCount, wordCount, lineCount);
return 0;
}
Example 6: html (not in book)
Fall 2012 Introduction to lex (or flex) 34
%{/* a Lex program that produces html, making
all C comments italic */
#include <stdio.h>
%}
%%
"/*" { printf("<i><font color=\"blue\">/*"); }
"*/" { printf("*/</font></i>"); }
\n { printf("<br>\n"); }
%%
main()
{ printf("<html><tt><b>\n"); yylex();
printf("</b></tt></html>"); return 0;
}
Example 7: A Scanner to Recognize Specific
Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 35
%{
/*
* We expand upon the first example by adding
* recognition of some other parts of speech.
*/
%}
Example 7: A Scanner to Recognize Specific
Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 36
%%
/* ignore white space */ ;
[\t ]+
is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go { printf("%s: is a verb\n", yytext); }
Example 7: A Scanner to Recognize Specific
Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 37
very |
simply |
gently |
quietly |
calmly |
angrily { printf("%s: is an adverb\n", yytext); }
to |
from |
behind |
above |
below |
between |
below { printf("%s: is a preposition\n", yytext); }
Example 7: A Scanner to Recognize Specific
Tokens
Fall 2012 Introduction to lex (or flex) 38
if |
then |
and |
but |
or { printf("%s: is a conjunction\n", yytext); }
their |
my |
your |
his |
her |
its { printf("%s: is an adjective\n", yytext); }
Example 7: A Scanner to Recognize Specific
Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 39
I |
you |
he |
she |
we |
they { printf("%s: in a pronoun\n", yytext); }
[a-zA-Z]+ {
printf("%s: don't recognize, might be a noun\n", yytext);
}
\&.|\n { ECHO; /* normal default anyway */ }
%%
main()
{
yylex();
}
But What About Those Pesky C Comments?
Fall 2012 Introduction to lex (or flex) 40
Match with \/\*\/*(\**[^/*]\/*)*\**\*\/
Or with “/*””/”*(“*”*[^/*]”/”*)*”*”*”*/”
But what if we want to process stuff inside a comment
(like \n, for example)?
Do it by hand matching (Ex 2.23, pp. 87-88 and tiny.l)
Use a new feature of flex that allows explicit state management
Final Example (flex documentation)
Fall 2012 Introduction to lex (or flex) 41
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
/* eat anything that's not a '*' */
<comment>[^*\n]*
/* eat up '*'s not followed by '/'s */
<comment>"*"+[^*/\n]*
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);