241-437 compilers: lex analysis/2 1 compiler structures objective – –what is lexical analysis?...

53
241-437 Compilers: lex analysis/2 Compiler Structures Objective what is lexical analysis? look at a lexical analyzer for a simple 'expressions' language 241-437, Semester 1, 2011-2012 2. Lexical Analysis

Upload: avice-fleming

Post on 05-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 1

Compiler Structures

• Objective– what is lexical analysis?– look at a lexical analyzer for a simple 'expressions' language

241-437, Semester 1, 2011-2012

2. Lexical Analysis

Page 2: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 2

Overview

1. Why Lexical Analysis?

2. Using a Lexical Analyzer

3. Implementing a Lexical Analyzer

4. Regular Expressions (REs)

5. The Expressions Language

6. exprTokens.c

7. From REs to Code Automatically

Page 3: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 3

In this lecture

Source Program

Target Lang. Prog.

Semantic Analyzer

Syntax Analyzer

Lexical Analyzer

FrontEnd

Code Optimizer

Target Code Generator

BackEnd

Int. Code Generator

Intermediate Code

Page 4: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 4

1. Why Lexical Analysis?

• Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants)

• Simplifies the design of the rest of the compiler– the code uses tokens, not strings or characters

• Can be implemented efficiently– by hand or automatically

• Improves portability– non-standard symbols / foreign characters are translated here, so do

not affect the rest of the compiler

Page 5: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 5

2. Using a Lexical Analyzer

LexicalAnalyzer

(using chars)

SyntaxAnalyzer

(using tokens)

SourceProgram

3. Token,token value

1. Get nexttoken

lexicalerrors

syntaxerrors

2. Get charsto makea token

Page 6: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 6

A Source Program is Chars

Consider the program fragment:

if (i==j);z=1;

else;z=0;

endif;

The lexical analyzer reads it in as a string of characters:

if_(i==j);\n\tz=1;\nelse;\n\tz=0;\nendif;

Lexical analysis divides the string into tokens.

Page 7: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 7

Tokens and Token Values

Lexical Analyzer

<id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”>

"y = 31 + 28*foo"

Syntax Analyzer

token

token value

get tokens(one at a time)

get chars

Page 8: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 8

Tokens, Lexemes, and Patterns

• A token is a lexical type– e.g id, int

• A lexeme is a token value– e.g. "abc", 123

• A pattern says how to make a token from chars– e.g. id = letter followed by letters and digits

int = non-empty sequence of digits

– a pattern is defined using regular expressions (REs)

Page 9: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 9

3. Implementing a Lexical Analyzer

Issues:• Lookahead

– how to group chars into tokens

• Ignoring whitespace and comments.• Separating variables from keywords

– e.g. "if", "else"

• (Automatically) translating REs into a lexical analyzer.

Page 10: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 10

Lookahead

• A token is created by reading in characters, and grouping them together.

• It is not always possible to decide if a token is finished without looking ahead at the next char.

• For example:– Is "i" a variable, or the first character of "if"?– Is "=" an assignment or the beginning of "=="?

Page 11: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 11

4. Regular Expressions (REs)

• REs are an algebraic way of specifying how to recognise input– ‘algebraic’ means that the recognition pattern is

defined using RE operands and operators

Covered in moredetail in 240-304"maths for CoE"

Page 12: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 12

4.1. REs in grep

• grep searches input lines, a line at a time.• If the line contains a string that matches gre

p's RE (pattern), then the line is output.

grep "RE"

input lines(e.g. from a file)

hello andymy name is andymy bye byhe

output matching lines(e.g. to a file)

continued

Page 13: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 13

Examples

grep "and"hello andymy name is andymy bye byhe

hello andymy name is andy

hello andymy name is andymy bye byhe

hello andymy name is andymy bye byhe

continued

"|" means "or"

grep -E "an|my"

Page 14: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 14

grep "hel*"hello andymy name is andymy bye byhe

hello andymy bye byhe

"*" means "0 or more"

Page 15: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 15

4.2. The RE Language

• A RE defines a pattern which recognises (matches) a set of strings– e.g. a RE can be defined that recognises the st

rings { aa, aba, abba, abbba, abbbba, …}

• These recognisable strings are sometimes called the RE’s language.

Page 16: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 16

RE Operands

• There are 4 basic kinds of operands:– characters (e.g. ‘a’, ‘1’, ‘(‘)

– the symbol (means an empty string ‘’)

– the symbol {} (means the empty set)

– variables, which can be assigned a RE• variable = RE

Page 17: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 17

RE Operators

• There are three basic operators:– union ‘|’– concatenation – closure *

Page 18: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 18

Union

• S | T– this RE can use the S or T RE to match strings

• Example REs:a | b matches strings {a, b}

a | b | c matches strings {a, b, c }

Page 19: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 19

Concatenation

• S T– this RE will use the S RE followed by the T RE

to match against strings

• Example REs:a b matches the string { ab }

w | (a b) matches the strings {w, ab}

Page 20: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 20

• What strings are matched by the RE(a | ab ) (c | bc)

• Equivalent to:{a, ab} followed by {c, bc}

=> {ac, abc, abc, abbc}

=> {ac, abc, abbc}

Page 21: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 21

Closure

• S*– this RE can use the S RE 0 or more times to ma

tch against strings

• Example RE:a* matches the strings:

{, a, aa, aaa, aaaa, aaaaa, ... }

empty string

Page 22: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 22

4.3. REs for C Identifiers

• We define two RE variables, letter and digit:letter = A | B | C | D ... Z |

a | b | c | d .... z

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

• id is defined using letter and digit:id = letter ( letter | digit )*

continued

Page 23: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 23

• Strings matched by id include:ab345 w h5g

• Strings not matched:2 $abc ****

Page 24: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 24

4.4. RE SummaryExpression Meaning

Empty patterna Any pattern represented by ‘a’ab Strings with pattern ‘a’ followed by ‘b’a|b Strings consisting of pattern ‘a’ or ‘b’a* Zero or more occurrences of patterns in ‘a’a+ One or more occurrences of patterns in ‘a’a3 Patterns in ‘a’ repeated exactly 3 times

a? (a | ) ; Optional single pattern from ‘a’. Any single character

Page 25: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 25

More Operators

• See the regular expressions "cheat-sheet" at See the regular expressions "cheat-sheet" at the course website in the "Useful Info" the course website in the "Useful Info" subdirectory:subdirectory:– over 80 operators!!over 80 operators!!

Page 26: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 26

Wild Card Symbol: '.'

• The ‘.’ stands for any character except the newline– e.g. grep ‘a..b.$’ chapter1.txt

grep ‘t.*t.*t’

/usr/share/dict/words

the UNIX/Linux 'dictionary'

Page 27: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 27

grep "a..b."AA'sAOLAOL's : :

adobealibiameba

/usr/share/dict/words

Page 28: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 28

4.5. REs for Integers and Floats

• We redefine digit: digit = 0|1|2|3|4|5|6|7|8|9

or digit = [1 – 9]

• int and float:int = {digit}+

float = {digit}+ "." {digit}+

Page 29: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 29

• Integers and floats with exponents:number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?

Page 30: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 30

4.6 More on REs

See RE summary on the course website:regular_expressions_cheat_sheet.pdf

I have the standard RE book:– Mastering Regular Expressions

Jeffrey E. F. FreidlO'Reilly & Associates

continued

Page 31: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 31

There are many websites that explain REs:

http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html

http://www.zytrax.com/tech/web/regex.htm

http://www.regular-expressions.info

Page 32: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 32

5. The Expressions Language

• In my expressions language, a program is a series of expressions and assignments.

• Example:

// test2.txt example

let x56 = 2let bing_BONG = (27 * 2) - x565 * (67 / 3)

Page 33: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 33

5.1. REs for the Language

• alpha = a | b | c | ... | z | A | B | ... | Z• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• alphanum = alpha | digit

• id = alpha (alphanum | '_' )*• int = digit+

Page 34: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 34

• keywords = "let" | "SCANEOF"

• punctuation = '(' | ')' |'+' | '-' | '*' | '/' |'=' | '\n'

• Ignore:– whitespace (but not newlines)– comments ("//" to the end of the line)

Page 35: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 35

5.2. From REs to Tokens

• Using the REs as a guide, we create tokens and token values. How?

• In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.

Page 36: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 36

Tokens and Token Values

• Token Token ValueID "var" and the id stringINT "num" and the value

LPAREN '('RPAREN ')'PLUSOP '+'MINUSOP '-'MULTOP '*'DIVOP '/'

Page 37: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 37

• Token Token ValueASSIGNOP '='NEWLINE '\n'

LET "let"SCANEOF eof character

Page 38: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 38

6. exprTokens.c

• exprTokens.c is a lexical analyzer for the expressions language.

• It reads in an expressions program on stdin, and prints out the tokens (and their values).

Page 39: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 39

6.1. Usage

> gcc -Wall -o exprTokens exprTokens.c

> ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof'

>

or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/

Page 40: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 40

6.2. Code• // constants for tokens and their values

#define NUMKEYS 2

typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF} Token;

char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=",

"+", "-", "*", "/", "eof"};

char *keywords[NUMKEYS] = {"let", "SCANEOF"};Token keywordToks[NUMKEYS] = {LET, SCANEOF};

Page 41: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 41

Callgraph for exrprTokens.c

calls

Page 42: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 42

main() and its globals

• Token currToken;int lineNum = 1; // num lines read in

int main(void){ printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF);

return 0;}

Page 43: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 43

Printing the Tokens• #define MAX_IDLEN 30

char tokString[MAX_IDLEN];int currTokValue; // used when token is an integer

void printToken(void){ if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT) // a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks} // end of printToken()

Page 44: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 44

Getting a Token

• void nextToken(void){ currToken = scanner(); }

Page 45: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 45

scanner() OverviewToken scanner(void) // converts chars into a token{ int inCh; clearTokStr();

if (feof(stdin)) return SCANEOF;

while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;

Page 46: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 46

else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token

// return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token

// change token to int return INT; } else if (inCh == '(') return LPAREN;

else if ... // more tests of inCh ...

else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF;} // end of scanner()

punctuation

Page 47: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 47

Processing an ID

:else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')*

extendTokStr(inCh);

for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar())

extendTokStr(inCh);

ungetc(inCh, stdin);

return checkKeyword();

} :

in scanner()

Page 48: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 48

Token String Functionsvoid clearTokStr(void)// reset the token string to be empty{ tokString[0] = '\0'; tokStrLen = 0;} // end of clearTokStr()

void extendTokStr(char ch)// add ch to the end of the token string{ if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string }} // end of extendTokStr()

Page 49: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 49

Checking for a Keyword

Token checkKeyword(void)

{

int i;

for(i=0; i<NUMKEYS; i++) {

if(!strcmp(tokString, keywords[i]))

return keywordToks[i];

}

return ID;

} // end of checkKeyword()

Page 50: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 50

Processing an INT

:else if (isdigit(inCh)){ // INT = DIGIT+

extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh);

inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; }

:

in scanner()

Page 51: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 51

Reporting an Error

void lexicalErr(char ch){

printf("Lexical error at \"%c\" on line %d\n", ch, lineNum);

exit(1);}

No recovery attempted.

Page 52: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 52

6.3. Some Good News

• Most programming languages use very similar lexical analyzers– e.g. the same kind of IDs, INTs, punctuation,

and keywords

• Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.

Page 53: 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions

241-437 Compilers: lex analysis/2 53

7. From REs to Code Automatically

1. Write the REs for the language.

2. Convert to Non-deterministic Finite Automata (NFA).

3. Convert to Deterministic Finite Automata (DFA)

4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser.

• There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.