compiler tools lex/yacc – flex & bison. compiler front end (from engineering a compiler)...

45
Compiler Tools Lex/Yacc – Flex & Bison

Post on 22-Dec-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

Compiler Tools

Lex/Yacc – Flex & Bison

Compiler Front End (from Engineering a Compiler)

Scanner (Lexical Analyzer)• Maps stream of characters into words

Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <plus_op,+> <id,y> <sc,; >

• The actual words are its lexeme• Its part of speech (or syntactic category) is called its token

type• Scanner discards white space & (often) comments

Sourcecode Scanner

Intermediate RepresentationParser

Errors

tokens

Speed is an issue in scanning

use a specialized recognizer

The Front End (from Engineering a Compiler)

Parser• Checks stream of classified words (parts of speech) for

grammatical correctness• Determines if code is syntactically well-formed• Guides checking at deeper levels than syntax• Builds an IR representation of the code

Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc).

Sourcecode Scanner

IRParser

Errors

tokens

The Big Picture

• Language syntax is specified with parts of speech, not words

• Syntax checking matches parts of speech against a grammar

1. goal expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | –

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7}

No words here!

Parts of speech, not words!

Why study lexical analysis?

• We want to avoid writing scanners by hand• Finite automata are used in other applications: grep,

website filtering, various “find” commands

Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies

Scanner

ScannerGenerator

specifications

source code parts of speech & words

tables or code

Specifications written as “regular expressions”

Represent words as indices into a global table

Finite Automata

Formally a finite automata is a five-tuple(S,,, s0, SF) where

• S is the set of states, including error state Se. S must be finite.

• is the alphabet or character set used by recognizer. Typically union of edge labels (transitions between states).

• (s,c) is a function that encodes transitions (i.e., character c in changes to state s in S. )

• s0 is the designated start state

• SF is the set of final states, drawn with double circle in transition diagram

Finite Automata

Finite automata to recognize fee and fie:

S = {s0, s1, s2, s3, s4, s5, se}

= {f, e, i}

(s,c) set of transitions shown above

s0 = s0

SF= { s3, s5}

Set of words accepted by a finite automata F forms a language L(F). Can also be described by regular expressions.

S0 S4S1

f

S3

S5

S2

e

i e

e

Finite Automata Quick Exercise

Draw a finite automata that can recognize CU | CSU | CSM | DU (drawing included below for reference)

S0 S4S1

f

S3

S5

S2

e

i e

e

Regular Expressions in Lex*

The characters that form regular expressions include: . matches any single character except newline * matches zero or more copies of preceding expression [] a character class that matches any character within the

brackets. If first character is ^ will match any character except those within brackets. A dash can be used for character range, e.g., [0-4] is equivalent to [01234]. more in book…

^ matches beginning of line as first character of expression (also negation within [], as listed above).

$ matches end of line as last character of expression {} indicates how many times previous pattern is allowed to

match, e.g., A{1,3} matches one to three occurrences of A. \ used to escape metacharacters, e.g., \* is literal asterisk, \”

is a literal quote, \{ is literal open brace, etc.

* from lex & yacc by Levine, Mason & Brown

Regular Expressions, continued

+ matches one or more occurrences of preceding expression, e.g., [0-9]+ matches “1” “11” or “1234” but not empty string

? matches zero or one occurrence of preceding expression, e.g., -?[0-9]+ matches signed number with optional leading minus sign

| matches either preceding or following expression, e.g., cow|pig|sheep matches any of the three words

“…” interprets everything inside quotation marks literally / matches preceding expression only if followed by following

expression, e.g., 0/1 matches “0” in “01” but not in “02”. Material in pattern following the / is not “consumed”

() Groups a series of regular expressions into a new regular expression, e.g., (01) becomes character sequence 01. Useful when building up complex patterns with *, + and |.

Regular Expression Examples

digit: [0-9] int with at least 1 digit: [0-9]+ int that can have 0 digits: [0-9]* What about float?

[0-9]*\.[0-9]+ // literal ., at least 1 digit after . – what about 0 or 2?

([0-9]+)| ([0-9]*\.[0-9]+) // combine int and float, notice use of (), what about unary -?

-?(([0-9]+)| ([0-9]*\.[0-9]+))

More Regular Expression Examples

What’s a regular expression for matching quotes? \”.*\” won’t work for lines like “mine”

and “yours” because lex matches largest possible pattern.

\”[^”\n]*[“\n] will work by excluding “ (forces lex to stop as soon as “ is reached). The \n keeps a quoted string from exceeding one line.

Flex – Fast Lexical Analyzer

FLEXscanner

(program to recognize patterns

in text)

regular expressions& C-code rules

lex.yy.c, contains yylex()

compile

executable – analyzesand executes input

Here’s where we’ll put the regular expressions to good use!

Flex input file

3 sectionsdefinitions

%%

rules

%%

user code

Definition Section Examples

name definition DIGIT [0-9]

ID [a-z][a-z0-9]* A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to:([0-9])+"."([0-9])*

C Code

Can include C-code in definitions%{

/* This is a comment inside the definition

*/

#include <math.h> // may need headers

%}

Rules

The rules section of the flex input contains a series of rules of the form: pattern action

In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.

Example: Simple Pascal-like recognizer

Definitions section:/* scanner for a toy Pascal-like language */

%{ /* need for the call to atof() below */

#include <math.h> %}DIGIT [0-9] ID [a-z][a-z0-9]*

Remember these are on a lineby themselves, unindented!

}Lines inserted as-is intoresulting code

} Definitions that can be used inrules section

Example continued

Rules section:%%

{DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext ));}

{DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n", yytext, atof(yytext));}

if|then|begin|end|procedure|function {printf("A keyword: %s\n", yytext);}

{ID} { printf( "An identifier: %s\n", yytext ); }

"+"|"-"|"*"|"/" { printf( "An operator: %s\n", yytext ); }

"{"[^}\n]*"}" /* eat up one-line comments */

[ \t\n]+ /* eat up whitespace */

. { printf( "Unrecognized character: %s\n", yytext ); }

pattern actiontext that matched the pattern(a char*)

Example continued

User code (required for flex, in library for lex)

%% int main(int argc, char ** argv ) {

++argv, --argc; /* skip over program name */

if ( argc > 0 ) yyin = fopen( argv[0], "r" );

else yyin = stdin; yylex();

}

lexer function produced by lex

lex input file

Flex exercise #1

1. Download pascal.l2. From a command prompt (Start->Run->cmd):

Flex -opascal.c -L pascal.l NOTE: without –o option output file will be called lex.yy.c -L option suppresses #lines that cause problems with

some compilers (e.g. DevC++)3. Compile and execute pascal.c (batch on Blackboard)

gcc –opascal.exe –Lc:\progra~1\gnuwin32\lib pascal.c –lfl -ly

4. Execute program. Type in digits, ids, keywords etc. End program with Ctrl-Z

Flex exercise #2

Copy words.l (from lex & yacc) Use flex then compile and execute What does it do? Extend the example with 1 new part of

speech. Recognize lexemes R0-R9 as register

names Recognize complex numbers, including for

example -3+4i, +5-6i, +7i, 8i, -12i, but not 3++4i (hint: print newline before displaying your complex number, lexer may display 3+ and then recognize +4i)

Lex techniques

Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.

And now…

Let’s continue with chapter 4!

Bison – like Yacc (yet another compiler compiler)

Context-free Grammarin BNF form, LALR(1)* Bison

Bison parser (c program)group tokens according togrammar rules

Bison parser provides yyparseYou must provide:• the lexical analyzer (e.g., flex)• an error-handling routine named yyerror• a main routine that calls yyparse

*LookAhead Left Recursive

Bison Parser

Same sections as flex (yacc came first): definitions, rules, C-Code

Bison Parser – Definition Section Definition Section

Tokens used in grammar, values used on parser stack, may include C code within %{ }%

Single quoted characters can be used as tokens without declaring them, e.g., ‘+’, ‘=‘ etc.

List tokens, Bison will create header with defines%token NAME NUMBER

YYSTYPE determines the data type of the values returned by the lexer. If lexer returns different types depending on what is read, include a union:%union {char cval;char *sval;int ival;

} Types declared in union can be used to specify types for

tokens and also for non-terminals%token <ival>NUMBER%type <sval>bibKey

Bison Parser – Rule Section

Use : between lhs and rhs, place ; at end. statement: NAME ‘=‘ expression

| expression { printf("= %d\n", $1); } ;

expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; }

; Unlike flex, bison doesn’t care about line

boundaries, so add white space for readability Symbol on lhs of first rule is start symbol, can

override with %start declaration in definition section

$1, $3 refer to RHS values. $$ sets value of LHS. In expression, $$ = $1 + $3 means it sets the value of

lhs (expression) to NUMBER ($1) + NUMBER ($3)

whitespace

More on Symbol Values and Actions

Symbols in bison have values. YYSTYPE typedef contains value types Default for all values is int

A rules action is executed when the parser reduces that rule (will have recognized both NUMBER symbols, lexer should have returned a value via yylval).expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; }

| NUMBER ‘-’ NUMBER { $$ = $1 - $3; }

;

More on Symbol Values and Actions

Example to return int value:[0-9]+ { yylval = atoi(yytext); return NUMBER;}

returns recognized tokensets value for use in actions

This one just returns the numericvalue of the string stored in yytext

In prior examples we just returned tokens, not values

Bison Parser – C Section

At a minimum, provide yyerror and main routines

yyerror(char *errmsg){ fprintf(stderr, "%s\n", errmsg);}

main(){yyparse();

}

Bison Intro Exercise

Download SimpleCalc.y and SimpleCalc.l Create calculator program:

bison -d simpleCalc.y flex -L -osimpleCalc.c simpleCalc.l gcc -c simpleCalc.c gcc -c simpleCalc.tab.c gcc -Lc:\progra~1\gnuwin32\lib simpleCalc.o simpleCalc.tab.o -osimpleCalc.exe -lfl –ly

As a convenience, you can use the batch file mbison.bat instead of typing all the above: mbison simpleCalc

Test with valid sentences (e.g., 3+6-4) and invalid sentences.

Understanding simpleCalc%{#include "simpleCalc.tab.h"extern int yylval;%}

%%[0-9]+ { yylval = atoi(yytext); return NUMBER; }[ \t]; /* ignore white space */\n return 0; /* logical EOF */. return yytext[0];

%%/*---------------------------------------*//* 5. Other C code that we need. */ yyerror(char *errmsg){ fprintf(stderr, "%s\n", errmsg);}

main(){

yyparse();}

#ifndef YYTOKENTYPE# define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 };#endif/* Tokens. */#define NAME 258#define NUMBER 259

simpleCalc.tab.hsimpleCalc.l

Explanation:When the lexer recognizes a number[0-9]+ it returns the token NUMBERand sets yylval to the corresponding integer value.When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it.When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error.

Understanding simpleCalc, continued

%token NAME NUMBER%%statement: NAME '=' expression

| expression { printf("= %d\n", $1); };

expression: expression '+' NUMBER { $$ = $1 + $3; }| expression '-' NUMBER { $$ = $1 - $3; }| NUMBER { $$ = $1; };

ExplanationWhen you execute simpleCalc and type an expression such as 1+2, the main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), calls lex which returns +, calls lex to recognize 2 as a NUMBER. At this point it will recognize expression + NUMBER and “reduce” this rule, meaning it does the action {$$ = $1 + $3}. It then recognizes expression as a statement, so it does the printf action.

Even more detail (if you’re curious)

Running flex creates simpleCalc.c. This creates the following case statement (I added the printf statements:

case 1:YY_RULE_SETUPprintf("returning number value %d\n", atoi(yytext));{ yylval = atoi(yytext); return NUMBER; }

YY_BREAKcase 2:YY_RULE_SETUPprintf("ignoring white space\n");; /* ignore white space */

YY_BREAKcase 3:YY_RULE_SETUPprintf("recognized eof\n");return 0; /* logical EOF */

YY_BREAKcase 4:YY_RULE_SETUPprintf("returning other character %c\n", yytext[0]);return yytext[0];

YY_BREAK

Continuing more detail

Running bison creates simpleCalc.tab.c switch (yyn) { case 3:#line 4 "simpleCalc.y" { printf("= %d\n", (yyvsp[0])); ;} break;

case 4:#line 7 "simpleCalc.y" { (yyval) = (yyvsp[-2]) + (yyvsp[0]); ;} break;

case 5:#line 8 "simpleCalc.y" { (yyval) = (yyvsp[-2]) - (yyvsp[0]); ;} break;

case 6:#line 9 "simpleCalc.y" { (yyval) = (yyvsp[0]); ;} break;

NOTE: I added extra printf statements to each case, which is whatyou can see in the trace.

Notice use of stack pointer sp for $values

Continuing more detail

In exercise 2 you define a union. This gets translated to code within SimpleCalc.tab.h:#if ! defined (YYSTYPE) && ! defined

(YYSTYPE_IS_DECLARED)

#line 1 "simpleCalcEx2.y"

typedef union YYSTYPE {

float fval;

int ival;

} YYSTYPE;

extern YYSTYPE yylval;

This is what makes your yylval return part of the union

Continuing more detail

Symbols you define in bison’s CFG are added to a symbol table:

static const char *const yytname[] ={ "$end", "error", "$undefined", "NUMBER", "FNUMBER", "NAME", "'='",

"'+'", "'*'", "'('", "')'", "$accept", "statement", "expression", "term",

"factor", 0};

Continuing more detail

New rules make use of union:switch (yyn) { case 3:#line 15 "simpleCalcEx2.y" { printf("= %f\n", (yyvsp[0].fval)); ;} break;

case 4:#line 18 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[-2].fval) + (yyvsp[0].fval); ;} break;

case 5:#line 19 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[0].fval); ;} break;

expression is defined as <fval>, so is NUMBER

Bison Exercise #1

Change simpleCalc to handle + and * with correct precedence using the grammar with terms and factors presented in chapter 4 of text:

Expr -> Expr + Term| Term

Term -> Term * Factor | FactorFactor -> (Expr)

| NUMBER

changed id to NUMBER for simplicity

Bison Exercise #2 Change simpleCalc.l to accept floating point values OR integers.

Remove extern int yylval; (yylval is no longer simply an int) Modify simpleCalc.tab.h if you change the name of your file. use atof for floating point value you will create a union in simpleCalc.y. Use the name of that union in

simpleCalc.l, for example yylval.ival = atoi(yytext); would be used to set a named union of ival to an integer value.

Change simpleCalc.y to accept floating point values. Create a union, example:%union {

float fval;int ival;

} Add %token statements for every token and %type statements for your

non-terminals, for example:%token <ival>NUMBER%type <fval> expression Update factor to accept NUMBER or a floating point type of number

(e.g., FNUMBER) The printf in statement needs to print a floating point value (printf("=

%f\n", $1);)

Bison Exercise #3

Update simpleCalc to accept statements like @myVar = 3.4*4 Output will be: myVar = 13.6 Purpose:

adding another type to union (char*). I called this sval. using a C-function as part of lexer to preprocess yytext before setting

yylval. Steps in simpleCalc.l:

add prototype for a function named extract_name. The parameter to this function is a char* (you will pass in yytext). You can either return a char* or just modify the parameter, since it’s an array. Prototype is in declaration section.

add function extract_name to C section. This function will just remove the @ from the front of the variable name. HINT: remember that c strings end in ‘\0’.

You can modify this string in place, but for more extensive processing you might need to create your own c-strings. You can use malloc, strdup and free in such a case.

When you have recognized a variable (@ followed by upper or lower case letters, in our simple example), you will set yylval.sval = extract_name(yytext);

Steps in simpleCalc.y: Be sure you still have NAME = expression in your grammar, and add an

action so it prints both the variable and the expression result. Declare NAME as a token of type <sval> (or whatever name you used in

your union)

Bison Exercise #4

Modify simpleCalc.l so that it accepts input from a file. The last slide contains a main method that will read from a file.

Create a small input file with a single line of input, something like: @myVar = 8+3*2.5+6

Summary of steps (from online manual)

The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts:

1. Formally specify the grammar in a form recognized by Bison (i.e., machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements.

2. Write a lexical analyzer to process input and pass tokens to the parser.

3. Write a controlling function (main) that calls the Bison-produced parser.

4. Write error-reporting routines.

Using files with Bison

The standard file for Bison is yyin. The following code can be used to take an optional command-line argument:

int main(argc, argv)int argc;char **argv;{

FILE *file;if (argc == 2){

file = fopen(argv[1], "r"); if (!file) {

fprintf(stderr, “Couldn't open %s\n", argv[1]); exit(1);

}yyin = file;

}