cs252: systems programming ninghui li topic 4: regular expressions and lexical analysis

20
CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

Upload: devon-twiford

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

CS252: Systems Programming

Ninghui Li

Topic 4: Regular Expressions and Lexical Analysis

Page 2: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 2

Compiler Frontend Steps

Lexical analyzer/scanner

convert sequence of characters to sequence of tokens

(inc 13) becomes 4 tokens, (, inc, 13, )

Parser/syntactic analysis

analyze a sequence of tokens to create/determine the grammatical structure

Page 3: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 3

Brief Description of the Lab

Part 1: Implement FIZ without user-defined functions (50%), due Feb 9

Part 2: Implement user-defined functions (50%), due Feb 16

• Part 2 is significant harder than Part 1. Do not wait until the last week.

Page 4: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

4

Using Lex/Flex with YACC/Bison

Page 5: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 5

Files Provided: fiz.l

"inc" { return INC;}

"(" { return OPENPAR;}

")" { return CLOSEPAR;}

0|[1-9][0-9]* {

yylval.number_val = atoi(yytext);

return NUMBER;

}

[ \t\n] {/* Discard spaces, tabs, and new lines */}

. {printf("Syntax error. Did not recognize %s\n", yytext); }

Page 6: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 6

Files Provided: fiz.y

/*******************************************************

* Section 1: Definition of tokens and non-terminals *

*****************************************************/

%token <number_val> NUMBER

%token INC OPENPAR CLOSEPAR

%type <node_val> expr

%union {

char *string_val;

int number_val;

struct TREE_NODE *node_val;

}

The NUMBER token has number_value

These three tokens have no value

A parsed expr has a pointer to a node in an Abstract Syntax Tree associated with it.

This defines the union associated with each token or non-terminal when parsing.

Page 7: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 7

Files Provided: fiz.y

/**************************************************

* Section 3: Grammar production rules *

**************************************************/

goal: statements;

statements: statement | statement statements;

statement: expr {

err_value = 0; resolve($1, NULL);

if (err_value == 0) {

printf ("%d\n", eval($1, NULL));

}

prompt();

};

Red code are currently unnecessary. They are needed when user-defined functions are implemented.

Green code evaluates the expression.$1 refers to the AST node associated with the 1st element in the grammar rule, namely expr

Page 8: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 8

Abstract Syntax Tree

A abstract syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.

Page 9: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 9

Abstract Syntax Tree: An Example

IFZ_NODE

ARG_NAMEstrValue = “y”

ARG_NAMEstrValue = “x”

FUNC_CALLname =“add”

INC_NODE

ARG_NAMEstrValue = “x”

DEC_NODE

ARG_NAMEstrValue = “y”

The above is an AST for (ifz y x (add (inc x) (dec y))),The body of the function (add x y)Consider how evaluate (add 4 1) would work.

Page 10: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 10

Grammar for expr

expr: OPENPAR INC expr CLOSEPAR {

struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct

TREE_NODE));

node -> type = INC_NODE;

node -> first_arg = $3;

$$ = node;

}The above production rule (grammar rule) parses (inc <expr>)It creates a node in the abstract syntax tree, denote its type to be INC_NODE,and stores the tree node for <expr> in first_arg; since this is the first (and only) argument of (inc <expr>).$3 refers to the value associated with the 3rd element in the grammar, i.e., expr in the body$$ refers to the value associated with expr on the left hand side

Page 11: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 11

Continuing grammar for expr

| NUMBER {

struct TREE_NODE * node = (struct TREE_NODE *) malloc(sizeof(struct TREE_NODE));

node -> type = NUMBER_NODE;

node -> intValue = $1;

$$ = node;

};The above production rule (grammar rule) parses a number into an expr.It creates a node in the abstract syntax tree, denote its type to be NUMBER_NODE, and stores the integer value in the intValue field.$1 refers to the value associated with the 1st element in the grammar, i.e., NUMBER in the body$$ refers to the value associated with expr on the left hand side

Page 12: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 12

What happens from Parsing?

Input (inc (inc 1))

Becomes tokens: OPENPAR INC OPENPAR INC NUMBER CLOSEPAR CLOSEPAR

This is parsed into statement in the following steps:

statement: expr

expr: OPENPAR INC expr CLOSEPAR

expr: OPENPAR INC NUMBER CLOSEPAR

INC_NODE

NUMBER_NODEintValue = 1

INC_NODE

Page 13: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 13

Regular Expressions: Tool for Lexical Analyzer

Regular expression: A notation to specify a pattern that matches a set of strings

A regular expression can be:

a a single character

R1|R2 matches anything that matches either R1 or R2

(R) matches the same thing as R

[abcde] any of the five letter listed there, i.e., a|b|c|d|e

[0-9] any digit

Page 14: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 14

Regular Expressions

R1R2 matches a string s if s is concatenation

of s1s2, and s1 matches R1 and s2 matches R2

E.g., [abcde] [0-9] matches

R* repeating the regular expression R zero or more times

E.g., [0-9]* matches the empty string and any digit sequence

R+ repeating R one or more times

Equivalent to the regular expression R R*

Page 15: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 15

RE Syntax in Lex/Flexhttp://flex.sourceforge.net/manual/Patterns.html

‘x’ match the character 'x' 

‘.’ any character (byte) except newline

‘[xyz]’ a character class; in this case, the pattern matches either an 'x', a 'y', or a 'z'

‘[abj-oZ]’ a "character class" with a range in it; matches an 'a', a 'b', any letter

from 'j' through 'o', or a 'Z'

‘[^A-Z]’ a "negated character class", i.e., any character but those in the class. In this

case, any character EXCEPT an uppercase letter. 

Page 16: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 16

RE Syntax in Lex/Flexhttp://flex.sourceforge.net/manual/Patterns.html

‘[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline

‘[a-z]{-}[aeiou]’ the lowercase consonants 

‘r*’ zero or more r's, where r is any regular

expression 

‘r+’ one or more r's 

‘r?’ zero or one r's (that is, “an optional r”)

‘r{2,5}’ anywhere from two to five r's

‘r{2,}’ two or more r's

‘r{4}’ exactly 4 r's

Page 17: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 17

RE Syntax in Lex/Flexhttp://flex.sourceforge.net/manual/Patterns.html

‘{name}’ the expansion of the ‘name’ definition

‘"[xyz]\"foo"’ the literal string: ‘[xyz]"foo’

‘(r)’ match an ‘r’; parentheses are used to override precedence

‘rs’ the regular expression ‘r’ followed by the regular expression ‘s’; called

concatenation

‘r|s’ either an ‘r’ or an ‘s’

‘^r’ an ‘r’, but only at the beginning of a line

‘r$’ an ‘r’, but only at the end of a line

Page 18: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

slide 18

ExamplesRegular expression for an non-negative integer:

Is [0-9]* correct?

Yes, if allowing 00123 is okay,

0 | [1-9][0-9]* is better

Regular expression for an identifier:

Rule 1:  Name of identifier includes alphabets and digits.

Rule 2:  First character of any identifier must be a letter.

How to write the regular expression?

[a-zA-Z][a-zA-Z0-9]*

Page 19: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

19

More Questions on RE

How to write regular expression that matches comments, assuming that comments are defined as anything between ; and end of line?

Page 20: CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis

20

Review

Able to write simple regular expressions to match strings.

Given a regular expression, able to tell what are matched are what are not.