by neng-fa zhou lexical analysis 4 why separate lexical and syntax analyses? –simpler design...

37
by Neng-Fa Zhou Lexical Analysis Why separate lexical and syntax analyses? simpler design efficiency portability

Upload: bryce-underwood

Post on 05-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Lexical Analysis

Why separate lexical and syntax analyses?– simpler design– efficiency– portability

Page 2: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Tokens, Patterns, Lexemes

– Tokens• Terminal symbols in the grammar

– Patterns• Description of a class of tokens

– Lexemes• Words in the the source program

Page 3: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Languages

– Fixed and finite alphabet (vocabulary)– Finite length sentences– Possibly infinite number of sentences

Examples– Natural numbers {1,2,3,...10,11,...}– Strings over {a,b} anban

Terms on parts of a string– prefix, suffix, substring, proper ....

Page 4: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Operations on Languages

Page 5: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Examples

L = {A,B,...,Z,a,b,...,z}D = {0,1,...,9}

L D : the set of letters and digitsLD : a letter followed by a digitL4 : four-letter stringsL* : all strings of letters, including L(L D)* : strings of letters and digits beginning with a letterD+ : strings of one or more digits

Page 6: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Regular Expression(RE)

is a RE a symbol in is a RE Let r and s be REs.

– (r) | (s) : or– (r)(s) : concatenation– (r)* : zero or more instances– (r)+ : one or more instances– (r)? : zero or one instance

Page 7: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Precedence of Operators

high

low

r* r+ r?

rs

r|s

all left associative Examples

= {a,b}1. a|b2. (a|b)(a|b)3. a*4. (a|b)*5. a| a*b

Page 8: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Algebraic Properties of RE

Page 9: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

d1 r1

d2 r2

dn rn

....di is a RE over {d1,d2,...,di-1}

Regular Definitions

not recursive

Page 10: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

Example-1

by Neng-Fa Zhou

%{ int num_lines = 0, num_chars = 0;%} %% \n ++num_lines; ++num_chars; . ++num_chars;

%%main(){ yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars );}

yywrap(){return 0;}

Page 11: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example-2D [0-9]INT {D}{D}*

%%{INT}("."{INT}((e|E)("+"|-)?{INT})?)? {printf("valid %s\n",yytext);}. {printf("unrecognized %s\n",yytext);}%%int main(int argc, char *argv[]){

++argv, --argc;if (argc>0) yyin = fopen(argv[0],"r"); else yyin = stdin;yylex();

}

yywrap(){return 0;}

Page 12: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

java.util.regex

by Neng-Fa Zhou

import java.util.regex.*;

class Number { public static void main(String[] args){

String regExNum = "\\d+(\\.\\d+((e|E)(\\+|-)?\\d+)?)?";if (Pattern.matches(regExNum,args[0])) System.out.println("valid");else System.out.println("invalid");

}}

Page 13: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

String Pattern Matching in Perl

by Neng-Fa Zhou

print "Input a string :";$_ = <STDIN>;chomp($_);if (/^[0-9]+(\.[0-9]+((e|E)(\+|-)?[0-9]+)?)?$/){ print "valid\n";} else { print "invalid\n"; }

Page 14: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Finite Automata

Nondeterministic finite automaton (NFA)

NFA = (S,T,s0,F)

– S: a set of states– T: a transition mapping– s0: the start state– F: final states or accepting states

Page 15: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example

Page 16: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Deterministic Finite Automata (DFA)

T: a transition function There is only one arc going out from each node on each symbol.

Page 17: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Simulating a DFA

s = s0;c = nextchar;while (c != eof) {

s = move(s,c);c = nextchar;

}if (s is in F)

return "yes";else

return "no";

Page 18: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

From RE to NFA

– a in

– s|t

Page 19: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

From RE to NFA (cont.)

– st

– s*

Page 20: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example

(a|b)*a

Page 21: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Building Lexical Analyzer

RE NFA DFA

Emulator

Algorithm 3.23(Thompson's construction)

Algorithm 3.32(Subset construction)

Page 22: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Conversion of an NFA into a DFA Intuition

– move(s,a) is a function in a DFA– move(s,a) is a mapping in a NFA

NFA DFA

A state reachable from s0 in the DFA on an input string corresponds to a set of states in NFA that are reachable on the same string.

Page 23: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Computation of -Closure

-Closure(T): Set of NFA states reachable from some NFA state s in T by transition alone.

Page 24: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

From an NFA to a DFA(The subset construction)

Page 25: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example

NFA

DFA

Page 26: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Algorithm 3.39

F, S-F};do begin for each group G in do begin

partition G into subgroups such that two states s and tof G are in the same subgroup iff for all input symbols a, s and t have transitions on a to states in the same group;

replace G in by the set of all subgroups formed; end if () return;; end;

Page 27: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example

a b

AC B ACB B DD B EE B AC

Page 28: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

Construct a DFA Directly from a Regular Expression

by Neng-Fa Zhou

Page 29: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Implementation Issues

Input buffering– Read in characters one by one

• Unable to look ahead

• Inefficient

– Read in a whole string and store it in memory• Requires a big buffer

– Buffer pairs

Page 30: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Buffer Pairs

Page 31: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Use Sentinels

Page 32: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Lexical Analyzer

Page 33: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Lex

A tool for automatically generating lexical analyzers

Page 34: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Lex Specifications

declarations%%

translation rules

%%auxiliary procedures

p1 {action1}p2 {action2}...pn {actionn}

Page 35: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Lex Regular Expressions

Page 36: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

yylex()

yylex(){switch (pattern_match()){ case 1: {action1} case 2: {action2}

... case n: {actionn}

}}

Page 37: By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability

by Neng-Fa Zhou

Example

DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ {printf("An integer:%s(%d)\n",yytext,atoi(yytext));}{DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n",yytext,atof(yytext));}if|then|begin|end|procedure|function {printf("A keyword: %s\n",yytext);}{ID} {printf("An identifier %s\n",yytext);}"+"|"-"|"*"|"/" {printf("An operator %s\n",yytext);}"{"[^}\n]*"}" {/* eat up one-line comments */}[ \t\n]+ {/* eat up white space */}. {printf("Unrecognized character: %s\n", yytext);}%%int main(int argc, char *argv[]){

++argv, --argc;if (argc>0) yyin = fopen(argv[0],"r"); else yyin = stdin;yylex();

}