ics312 lex set 25. lex lex is a program that generates lexical analyzers converting the source code...

27
ICS312 LEX Set 25

Upload: clementine-patrick

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

ICS312

LEX

Set 25

Page 2: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

LEX

• Lex is a program that generates lexical analyzers

• Converting the source code into the symbols (tokens) is the work of the C program produced by Lex.

• This program serves as a subroutine of the C program produced by YACC for the parser

Page 3: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Lexical Analysis

• LEX employs as input a description • of the tokens that can occur in the language

• This description is made by means of regular expressions, as defined on the next slide. Regular expressions define patterns of characters.

Page 4: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Basics of Regular Expressions

1. Any character (or string of characters) except those (called metacharacters) which have a special interpretation, such as () [] {} + * ? | etc.

For instance the string “if” in a regular expression will match the identical string in the source code.

Page 5: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

2. The period symbol “.” is used to match any single character in the source code except the new line indicator "\n".

Page 6: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

3.Square brackets are used to define a character class.  Either a sequence of symbols or a range denoted using the hyphen can be employed,e.g.:

[01a-z]

A character class matches a single symbol in the source code that is a member of the class.

For instance [01a-z] matches the character 0 or 1 or any lower case alphabetic character

 

Page 7: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

4. The "+" symbol following a regular expression denotes 1 or more occurrences of that expression.

For instance [0-9]+ matches any sequence of digits in the source code.

Page 8: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Similarly:

5. A "*" following a regular expression denotes 0 or more occurrences of that expression.

6. A “?" following a regular expression denotes 0 or 1 occurrence of that expression.

Page 9: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

7. The symbol “|”  is used as an OR operator to identify alternate choices.

For instance [a-z]+|9 matches either a lower case alphabetic or the digit “9”.

Page 10: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

8. Parentheses can be freely used.

For example:

(a|b)+ matches e.g. abba

while

a|b+ match a or a string of b’s.

Page 11: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

9. Regular expressions can be concatenated

For instance: [a-zA-Z]*[0-9]+[a-zA-Z]

matches any sequence of 0 or more letters, followed by 1 or more digits, followed by 1 letter

Page 12: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

As has been shown, symbols such as +, *, ?, ., (, ), [,]have special meanings in regular expressions.

10. If you want to include one of these symbols in a regular expression simply as a character, you can either use the c escape symbol “\” or double quotes.

For example: [0-9]”+”[0-9] or [0-9]\+[0-9]match a digit followed by a plus sign, followed by a

digit

Page 13: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Examples

Given: R = ( abb | cd ) and S = abc

RS = ( abbabc | cdabc ) is a regular expression. SR = ( abcabb | abccd ) is a regular expression.

The following strings are matched by R*: abbcdcdcdcd cdabbcdabbabbcd abb cd cdcdcdcdcdcdcd and so forth.

Page 14: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

What kinds of strings can be matched by the regular expression: ( a | c )* b ( a | c )* •( a | c )* is a regular expression that can match the empty string , or any string containing only a's and c's.

•b is a regular expression that can match a single occurrence of the symbol "b". •( a | c )* is the same as the first regular expression.

•So, the entire expression: ( a | c )* b ( a | c )* can match any string made up of a possibly empty string of a's and c's, followed by a single b, followed by a possibly empty string of a’s and c’s

•In other words the regular expression can match any string onthe alphabet {a,b,c} that contains exactly one b.

Page 15: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

What kinds of strings can be matched by the regular expression: ( a | c )* ( b | ) ( a | c )*

•This is the same as the previous example, except that theregular expression in the center is now: ( b | ) •( b | ) can match either an occurrence of a single b, or theempty string which contains no characters

•So the entire expression ( a | c )* ( b | ) ( a | c )* can match any string over the alphabet {a,b,c} that contains either 0 or 1 b's.

Page 16: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Precedence of Operations in Regular Expressions

From highest to lowest

Concatenation Closure (*) Alternation ( OR )

Examples: a | bcf means the symbol a OR the string bcf a( bcf* ) is the string abc followed by 0 or more repetitions of the symbol f. Note: this is the same as (abcf*)

Page 17: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

GRAMMARS vs REGULAR EXPRESSIONS

Consider the set of strings (ie. language) {an b an | n > 0}

A context-free grammar that generates this language is: S -> b

b -> a b a

However, as we will show later, it is not possible to construct a regular expression that recognizes this language.

It’s not relevant to this course, but you may be interested to know that it is, in turn, not possible to construct a context-free grammar for a language whose definition is a simple extension of that given above: {an b an bn an | n > 0}

Page 18: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

In the Lex definition file one can assign macro names to regular expressions e.g.:

• digit 0|1|2|...|9 assigns the macro name digit

• integer {digit}+ assigns the macro name integer to 1 or more repetitions of digit

NOTE. when using a macro name as part of a regular expression, you need to enclose the name in curly parentheses {}.

• Signed_int (+|-)?{integer} assigns macro name signed_int to an optional sign followed by an integer

• number {signed_int}(\.{integer})?(E{signed_int})? assigns the macro name number to a signed_int

followed by an optional fractional part followed by an optional exponent part

Page 19: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

• alpha [a-zA-Z]

assigns the macro name alpha to the character class given by a-z and A-Z

• identifier {alpha}({alpha}|{digit})*

assigns the macro name identifier to an alpha character followed by the alternation of either alpha characters or digits, with 0 or more repetitions.

Page 20: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

RULE

Using the regular expression for an identifieron the previous slide, what would be the first token of the following string?

MAX23= Z29 + 8

Lex picks as the "next" token, the longest string that can be matched by one of it regular expressions. In this case, MAX23 would be matched as an identifier, not just M or MA or MAX

Page 21: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

An example of a Lex definition file

/* A standalone LEX program that counts identifiers and commas */ /* Definition Section */%{ int nident = 0;    /* # of identifiers in the file being scanned */ int ncomma = 0;    /* # of commas in the file */ %}

/* definitions of macro names*/ digit   [0-9] alph    [a-zA-Z] %% /* Rules Section *//* basic of patterns to recognize and the code to execute when they occur */{alph}({alph}|{digit})*    {++nident;} ","                        {++ncomma;} .                                 ; %%

Page 22: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

An example of a scanner definition file (Cont.)

/* subroutine section *//* the last part of the file contains user defined code, as shown here. */

main() {     yylex();     printf( "%s%d\n", "The no. of identifiers = ", nident);     printf( "%s%d\n", "The no. of commas = ", ncomma); }

/* LEX calls this function when the end of the input file is reached */yywrap(){}

Page 23: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Generating the Parser Using YACC

•The structure of a grammar to be used with YACC for generating a parser is similar to that of LEX.  There is a definition section, a rules (productions) section, and a code section.

Page 24: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Example of an Input Grammar for YACC

%{ /* ARITH.Y Yacc input for a arithmetic expression evaluator */

#include <stdio.h> /* for printf */ #define YYSTYPE int int yyparse(void); int yylex(void); void yyerror(char *mes);%}

%token number

%%

Page 25: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Example of an Input Grammar for YACC (Cont.1)

program : expression {printf("answer = %d\n", $1);} ;

expression : expression '+' term {$$ = $1 + $3;} | term ;

term : term '*' number {$$ = $1 * $3;} | number ;

%%

Page 26: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

Example of an Input Grammar for YACC (Cont.2)

void main() { printf("Enter an arithmetic expression\n"); yyparse();}

/* prints an error message */ void yyerror(char *mes) {printf("%s\n", mes);}

Page 27: ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program

The LEX scanner definition file for the arithmetic expressions

grammar%{ /* lexarith.l lex input for a arithmetic expression evaluator */#include “y.tab.h” #include <stdlib.h> /* for atoi */ #define YYSTYPE int extern YYSTYPE yylval; %}digit [0-9]

%% {digit}+ {yylval = atoi(yytext); return number; } (" "|\t)* ;\n {return(0);} /* recognize Enter key as EOF */ . {return yytext[0];}

%% int yywrap() {}