some slides borrowed from m scherger - texas a&m ...sking/courses/compilers/slides/lex.pdflex...

42
Introduction to lex (or flex) Some slides borrowed from M Scherger

Upload: votram

Post on 16-Mar-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to lex (or flex)

Some slides borrowed from M Scherger

Lex/Flex: A Scanner Generator in C

Fall 2012 Introduction to lex (or flex) 2

Regular Expression

Nondeterministic Finite Automaton

Deterministic Finite Automaton

Table-driven Scanner

So why not do this with a tool?

Thomson’s Construction

“Subset” Construction

Lex

Fall 2012 Introduction to lex (or flex) 3

Lex is a such tool for creating lexical analyzers

M. E. Lesk and E. Schmidt 1975

Lexical analyzers tokenize input streams

Regular expressions define tokens

Tokens are the terminals of a language

Converts regular expressions into DFAs

DFAs are implemented as table driven state machines

Some versions of Lex are proprietary and so not all versions of *nix come with an open source version

flex – Fast Lexical Analyzer is an open source version

Vern Paxson

The Basic Process

Fall 2012 Introduction to lex (or flex) 4

Lex

compiler

C

Compiler

a.out

Lex source program

any.l

lex.yy.c

Input stream Sequence

of tokens

a.out

lex.yy.c

Format of a lex File

Fall 2012 Introduction to lex (or flex) 5

Definitions

%%

Rules

%%

User code

1st section holds declarations of simple name definitions and start conditions

2nd section holds pattern-action pairs

3rd section is copied directly to lex.yy.c C code and comments

Typical file extensions: .l .lex .flex

Compiling and Running

Fall 2012 Introduction to lex (or flex) 6

> flex linenos.flex

> gcc lexyy.c -lfl

> a.out < infile > outfile

yywrap()

issue

Regular Expressions and Lex

Fall 2012 Introduction to lex (or flex) 7

A regular expression is an expression that matches sets of strings (the “language” of the regular expression).

In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations choice (|),

concatenation (no operator),

and repetition (*).

A regular expression may also contain certain other metasymbols: parentheses for grouping (to change precedence, just as in

arithmetic)

others as needed to extend the operator set in useful ways

Regular Expressions in Lex

Fall 2012 Introduction to lex (or flex) 8

c - c is a single character

Matches the character c

\c – c is a single character

Use this to escape special characters

“str” - str is a string

Matches entire string str

[str]- str is a string

Matches any single character from str

RE Matches

A A

x x

d d

\. .

\n Newline

\t tab

“Abc” Abc

“The” The

[aeiou] Lowercase vowels

[abcde] The letters a to e

Regular Expressions – Character Classes

Fall 2012 Introduction to lex (or flex) 9

[x-y] – x and y are characters

All characters in the range x-y

These can be combined

[^str] – str is a string

RE Matches

[a-z] All lowercase characters

[0-9] All digits

[a-df-z] lowercase characters except e

[a-z0-9A-Z] Alphanumeric characters

[A-Zaeiou] Upper case letters and lc vowels

[^ \n\t] all non whitespace

[^aeiou] matches anything but lowercase vowels

Regular Expressions

Fall 2012 Introduction to lex (or flex) 10

p* – p is a pattern

Zero or more occurrences of p

p+ – p is a pattern

One or more occurrences of p

A* A AA AAA ....

r* r rr ...

ab*c* a ab ac abb abc acc abbb abbc abcc accc ...

A+ A AA AAA AAAA ...

ab+ ab abb abbb ....

a*b+ b ab bb aab abb bbb ..

Regular Expressions

Fall 2012 Introduction to lex (or flex) 11

p? - p is a pattern

Zero or one occurrences of p

p{m,n} – p is a pattern, m and n are ints

Matches m through n occurrences of p

if ,n is missing, n = m, if just n is missing n = ∞

A? A

ab?c? a ab ac abc

a{1,3} a aa aaa

a{1,1} a

a{1} a

a{3,} aaa aaaa aaaaa …

Regular Expressions

Fall 2012 Introduction to lex (or flex) 12

p1p2 – p1 and p2 are patterns

Matches p1 followed by p2

(p) - p is a pattern

Used to override precedence (group things)

p1|p2 – p1 and p2 are patterns

Matches either p1 or p2

Notice precedence

ab ab

a+b+ ab aab abb

(abc)+ abc abcabc abcabcabc …

abc+ abc abcc abccc …

a|an|the a an the

ba|ed ba ed

b(a|e)d bed bad

Regular Expression - Extra Things

Fall 2012 Introduction to lex (or flex) 13

p1/p2 – p1 and p2 are patterns

Matches p1 only if it's followed by p2

p2 is not part of yytext

RE: a+/bc

Input: aaabc bc aaaad

matches first aaa only..

^p – p is a pattern

matches p only if it is at the start of a line

p$ – p is a pattern

matches p only if it is at the end of a line

Two more complex examples

Fall 2012 Introduction to lex (or flex) 14

[-+]?[0-9]+(\.[0-9]+)?([Ee][-+]?[0-9]+)?

or:

nat = [0-9]+

signedNat = [-+]? nat

number = signedNat(\. nat)?

([Ee] signedNat)?

C comments

/\*/*(\**[^/*]/*)*\**\*/

Pattern Matching Examples

Fall 2012 Introduction to lex (or flex) 15

Format of a lex File

Fall 2012 Introduction to lex (or flex) 16

Definitions

%%

Rules

%%

User code

1st section holds declarations of simple name definitions and start conditions

2nd section holds pattern-action pairs

3rd section is copied directly to lex.yy.c

C code and comments

Definitions

Fall 2012 Introduction to lex (or flex) 17

Definitions are of the form:

name definition

A name begins with a letter or underscore followed by 0 or more letters, digits, '-', or '_'.

You access it with {name}

Example definitions:

Digit [0-9]

Char [A-Z]

AlphaNum [a-zA-Z0-9]

ws [ \n\t]

IntegerConst [0-9]+

Definitions Example

Fall 2012 Introduction to lex (or flex) 18

Digit [0-9]

Char [a-zA-Z]

AlphaNum [a-zA-Z0-9]

%%

{Digit}+”.”{Digit}+

({Char}|_)({AlphaNum}|[_-])* {printf(“A name '%s'\n”, yytext);}

%%

Rules

Fall 2012 Introduction to lex (or flex) 19

Rules are of the form:

pattern action

pattern is the RE to match and action is what to do when it is matched

Default rule is to echo the input

Lex matches the longest string possible

If a tie, it matches the 1st rule in the spec

Actions can be empty – do nothing

Actions can be complex

Use {} if multi-lined

don't forget ';'s

yytext contains the string matched

Example Rules

Fall 2012 Introduction to lex (or flex) 20

\n linecount++;

[0-9]+ sum+=atoi(yytext);

{ws}+

a|an|the printf(“found an article\n”);

[aeiou]+ { printf(“A string of vowels\n”); vcnt++; }

Predefined Rules

Fall 2012 Introduction to lex (or flex) 21

ECHO

Copy yytext to output

[a-z]+ ECHO;

REJECT

Go to the next alternative, that is the second choice rule to be selected and it’s action taken

she s++;

he h++;

Won’t count the imbedded he

she {s++; REJECT;}

he {h++; REJECT;} \n

But this will

Rules Example

ex1.l The commands

Fall 2012 Introduction to lex (or flex) 22

%%

a*b printf(“Token 1 found\n”);

c+ printf(“Token 2 found\n”);

%%

main() {

yylex();

}

lex ex1.l produces lex.yy.c

cc -o ex1 lex.yy.c – ll create executable

May need –lfl if using flex

./ex1 to execute

aaaaaaabbccd

Token 1 found

Token 1 found

Token 2 found

d

Default is stdin and

stdout so type

aaaaaaaabbccd <return>

An Example Count chars, words, lines

Fall 2012 Introduction to lex (or flex) 23

%{

unsigned ccnt=0, wcnt = 0, lcnt = 0;

%}

word [^ \t\n]+

eol \n

%%

{word}{wcnt++;ccnt+=yyleng;}

{eol} {ccnt++;lcnt++;}

. ccnt++;

%%

main() {yylex(); }

The %{ %} pair allow you

to make declarations for

your lexer

About lex

Fall 2012 Introduction to lex (or flex) 24

Lex uses some predefined functions stored in lex library

(link with -ll or -lfl)

By default lex copies input to output

By default lex reads stdin, writes stdout

Lex reads its input (a lex script) and produced lex.yy.c

Use %{ and %} in definitions section to declare globals

and put #includes

You can use flex instead

Not all 'lex'es are equal!

Man page has more info!

Example 1: The Simplest Example

Fall 2012 Introduction to lex (or flex) 25

The simplest example of a lex program is a scanner that acts like the UNIX `cat`program

%%

. |\n ECHO;

%%

Or it could be written as…

%%

. ECHO;

\n ECHO;

%%

Lex Predefined Variables

Fall 2012 Introduction to lex (or flex) 26

Flex Internal Names

Fall 2012 Introduction to lex (or flex) 27

Lex internal name Meaning/Use

lex.yy.c or lexyy.c Lex output file name yylex Lex scanning routine yytext string matched on current action yyleng length of yytext yyin Lex input file (default: stdin) yyout Lex output file (default: stdout) input Lex buffered input routine ECHO Lex default action (print yytext

to yyout)

See the Flex documentation for others

Flex Operational Conventions

Fall 2012 Introduction to lex (or flex) 28

yylex() runs until it is stopped by a return

ambiguity is resolved by order

any text not explicitly matched is echoed to stdout

EOF is automatically matched and returns 0 from yylex()

(unless yywrap() is suitably defined)

yylex() returns an int which can be a token

Example 2: wc

Fall 2012 Introduction to lex (or flex) 29

Here is a scanner that is similar to the UNIX `wc` command

%{

unsigned charCount = 0, wordCount = 0, lineCount = 0;

%}

%%

[^ \t\n] { wordCount++; charCount += yyleng; }

\n { charCount++; lineCount++; }

. charCount++;

%%

int main()

{

yylex();

printf("%d %d %d\n",charCount, wordCount, lineCount);

return 0;

}

Example 3: Line Numbers (p. 84)

Fall 2012 Introduction to lex (or flex) 30

%{

/* a Lex program that adds line numbers

to lines of stdin, printing to stdout */

#include <stdio.h>

int lineno = 1;

%}

line .*\n

%%

{line} { printf("%5d %s",lineno++,yytext); }

%%

main()

{ yylex(); return 0; }

Example 4: (pp. 86-87)

Fall 2012 Introduction to lex (or flex) 31

%{/* Selects only lines that end or begin with the letter 'a'. */

#include <stdio.h>

%}

ends_with_a .*a\n

begins_with_a a.*\n

%%

{ends_with_a} ECHO;

{begins_with_a} ECHO;

.*\n ;

%%

main()

{ yylex(); return 0; }

Example 5: wc again!

Fall 2012 Introduction to lex (or flex) 32

%{

unsigned charCount = 0, wordCount = 0, lineCount = 0;

%}

word [^ \t\n]+

eol \n

%%

{word} { wordCount++; charCount += yyleng; }

{eol} { charCount++; lineCount++; }

. charCount++;

Example 5: wc again! (cont.)

Fall 2012 Introduction to lex (or flex) 33

%%

int main(int argc,char *argv[])

{

if (argc > 1) {

FILE *file;

file = fopen(argv[1], "r");

if (!file) {

fprintf(stderr,"could not open %s\n",argv[1]);

exit(1);

}

yyin = file;

}

yylex();

printf("%d %d %d\n",charCount, wordCount, lineCount);

return 0;

}

Example 6: html (not in book)

Fall 2012 Introduction to lex (or flex) 34

%{/* a Lex program that produces html, making

all C comments italic */

#include <stdio.h>

%}

%%

"/*" { printf("<i><font color=\"blue\">/*"); }

"*/" { printf("*/</font></i>"); }

\n { printf("<br>\n"); }

%%

main()

{ printf("<html><tt><b>\n"); yylex();

printf("</b></tt></html>"); return 0;

}

Example 7: A Scanner to Recognize Specific

Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 35

%{

/*

* We expand upon the first example by adding

* recognition of some other parts of speech.

*/

%}

Example 7: A Scanner to Recognize Specific

Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 36

%%

/* ignore white space */ ;

[\t ]+

is |

am |

are |

were |

was |

be |

being |

been |

do |

does |

did |

will |

would |

should |

can |

could |

has |

have |

had |

go { printf("%s: is a verb\n", yytext); }

Example 7: A Scanner to Recognize Specific

Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 37

very |

simply |

gently |

quietly |

calmly |

angrily { printf("%s: is an adverb\n", yytext); }

to |

from |

behind |

above |

below |

between |

below { printf("%s: is a preposition\n", yytext); }

Example 7: A Scanner to Recognize Specific

Tokens

Fall 2012 Introduction to lex (or flex) 38

if |

then |

and |

but |

or { printf("%s: is a conjunction\n", yytext); }

their |

my |

your |

his |

her |

its { printf("%s: is an adjective\n", yytext); }

Example 7: A Scanner to Recognize Specific

Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 39

I |

you |

he |

she |

we |

they { printf("%s: in a pronoun\n", yytext); }

[a-zA-Z]+ {

printf("%s: don't recognize, might be a noun\n", yytext);

}

\&.|\n { ECHO; /* normal default anyway */ }

%%

main()

{

yylex();

}

But What About Those Pesky C Comments?

Fall 2012 Introduction to lex (or flex) 40

Match with \/\*\/*(\**[^/*]\/*)*\**\*\/

Or with “/*””/”*(“*”*[^/*]”/”*)*”*”*”*/”

But what if we want to process stuff inside a comment

(like \n, for example)?

Do it by hand matching (Ex 2.23, pp. 87-88 and tiny.l)

Use a new feature of flex that allows explicit state management

Final Example (flex documentation)

Fall 2012 Introduction to lex (or flex) 41

%x comment

%%

int line_num = 1;

"/*" BEGIN(comment);

/* eat anything that's not a '*' */

<comment>[^*\n]*

/* eat up '*'s not followed by '/'s */

<comment>"*"+[^*/\n]*

<comment>\n ++line_num;

<comment>"*"+"/" BEGIN(INITIAL);

Beware

Fall 2012 Introduction to lex (or flex) 42

'\.' - matches '.' (tick period tick)

'.' - matches '.', (tick anything tick)

“.” - matches a period