241-437 compilers: lex analysis/2 1 compiler structures objective – –what is lexical analysis?...

241-437 Compilers: lex analysis/2 1

Compiler Structures

• Objective– what is lexical analysis?– look at a lexical analyzer for a simple 'expressions' language

241-437, Semester 1, 2011-2012

2. Lexical Analysis


Overview

1. Why Lexical Analysis?

2. Using a Lexical Analyzer

3. Implementing a Lexical Analyzer

4. Regular Expressions (REs)

5. The Expressions Language

6. exprTokens.c

7. From REs to Code Automatically


In this lecture

Source Program

Target Lang. Prog.

Semantic Analyzer

Syntax Analyzer

Lexical Analyzer

FrontEnd

Code Optimizer

Target Code Generator

BackEnd

Int. Code Generator

Intermediate Code


1. Why Lexical Analysis?

• Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants)

• Simplifies the design of the rest of the compiler– the code uses tokens, not strings or characters

• Can be implemented efficiently– by hand or automatically

• Improves portability– non-standard symbols / foreign characters are translated here, so do

not affect the rest of the compiler


2. Using a Lexical Analyzer

LexicalAnalyzer

(using chars)

SyntaxAnalyzer

(using tokens)

SourceProgram

3. Token,token value

1. Get nexttoken

lexicalerrors

syntaxerrors

2. Get charsto makea token


A Source Program is Chars

Consider the program fragment:

if (i==j);z=1;

else;z=0;

endif;

The lexical analyzer reads it in as a string of characters:

if_(i==j);\n\tz=1;\nelse;\n\tz=0;\nendif;

Lexical analysis divides the string into tokens.


Tokens and Token Values

Lexical Analyzer

<id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”>

"y = 31 + 28*foo"

Syntax Analyzer

token

token value

get tokens(one at a time)

get chars


Tokens, Lexemes, and Patterns

• A token is a lexical type– e.g id, int

• A lexeme is a token value– e.g. "abc", 123

• A pattern says how to make a token from chars– e.g. id = letter followed by letters and digits

int = non-empty sequence of digits

– a pattern is defined using regular expressions (REs)


3. Implementing a Lexical Analyzer

Issues:• Lookahead

– how to group chars into tokens

• Ignoring whitespace and comments.• Separating variables from keywords

– e.g. "if", "else"

• (Automatically) translating REs into a lexical analyzer.


Lookahead

• A token is created by reading in characters, and grouping them together.

• It is not always possible to decide if a token is finished without looking ahead at the next char.

• For example:– Is "i" a variable, or the first character of "if"?– Is "=" an assignment or the beginning of "=="?


4. Regular Expressions (REs)

• REs are an algebraic way of specifying how to recognise input– ‘algebraic’ means that the recognition pattern is

defined using RE operands and operators

Covered in moredetail in 240-304"maths for CoE"


4.1. REs in grep

• grep searches input lines, a line at a time.• If the line contains a string that matches gre

p's RE (pattern), then the line is output.

grep "RE"

input lines(e.g. from a file)

hello andymy name is andymy bye byhe

output matching lines(e.g. to a file)

continued


Examples

grep "and"hello andymy name is andymy bye byhe

hello andymy name is andy



continued

"|" means "or"

grep -E "an|my"


grep "hel*"hello andymy name is andymy bye byhe

hello andymy bye byhe

"*" means "0 or more"


4.2. The RE Language

• A RE defines a pattern which recognises (matches) a set of strings– e.g. a RE can be defined that recognises the st

rings { aa, aba, abba, abbba, abbbba, …}

• These recognisable strings are sometimes called the RE’s language.


RE Operands

• There are 4 basic kinds of operands:– characters (e.g. ‘a’, ‘1’, ‘(‘)

– the symbol (means an empty string ‘’)

– the symbol {} (means the empty set)

– variables, which can be assigned a RE• variable = RE


RE Operators

• There are three basic operators:– union ‘|’– concatenation – closure *


Union

• S | T– this RE can use the S or T RE to match strings

• Example REs:a | b matches strings {a, b}

a | b | c matches strings {a, b, c }


Concatenation

• S T– this RE will use the S RE followed by the T RE

to match against strings

• Example REs:a b matches the string { ab }

w | (a b) matches the strings {w, ab}


• What strings are matched by the RE(a | ab ) (c | bc)

• Equivalent to:{a, ab} followed by {c, bc}

=> {ac, abc, abc, abbc}

=> {ac, abc, abbc}


Closure

• S*– this RE can use the S RE 0 or more times to ma

tch against strings

• Example RE:a* matches the strings:

{, a, aa, aaa, aaaa, aaaaa, ... }

empty string


4.3. REs for C Identifiers

• We define two RE variables, letter and digit:letter = A | B | C | D ... Z |

a | b | c | d .... z

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

• id is defined using letter and digit:id = letter ( letter | digit )*

continued


• Strings matched by id include:ab345 w h5g

• Strings not matched:2 $abc ****


4.4. RE SummaryExpression Meaning

Empty patterna Any pattern represented by ‘a’ab Strings with pattern ‘a’ followed by ‘b’a|b Strings consisting of pattern ‘a’ or ‘b’a* Zero or more occurrences of patterns in ‘a’a+ One or more occurrences of patterns in ‘a’a3 Patterns in ‘a’ repeated exactly 3 times

a? (a | ) ; Optional single pattern from ‘a’. Any single character


More Operators

• See the regular expressions "cheat-sheet" at See the regular expressions "cheat-sheet" at the course website in the "Useful Info" the course website in the "Useful Info" subdirectory:subdirectory:– over 80 operators!!over 80 operators!!


Wild Card Symbol: '.'

• The ‘.’ stands for any character except the newline– e.g. grep ‘a..b.$’ chapter1.txt

grep ‘t.*t.*t’

/usr/share/dict/words

the UNIX/Linux 'dictionary'


grep "a..b."AA'sAOLAOL's : :

adobealibiameba

/usr/share/dict/words


4.5. REs for Integers and Floats

• We redefine digit: digit = 0|1|2|3|4|5|6|7|8|9

or digit = [1 – 9]

• int and float:int = {digit}+

float = {digit}+ "." {digit}+


• Integers and floats with exponents:number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?


4.6 More on REs

See RE summary on the course website:regular_expressions_cheat_sheet.pdf

I have the standard RE book:– Mastering Regular Expressions

Jeffrey E. F. FreidlO'Reilly & Associates

continued


There are many websites that explain REs:

http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html

http://www.zytrax.com/tech/web/regex.htm

http://www.regular-expressions.info


5. The Expressions Language

• In my expressions language, a program is a series of expressions and assignments.

• Example:

// test2.txt example

let x56 = 2let bing_BONG = (27 * 2) - x565 * (67 / 3)


5.1. REs for the Language

• alpha = a | b | c | ... | z | A | B | ... | Z• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• alphanum = alpha | digit

• id = alpha (alphanum | '_' )*• int = digit+


• keywords = "let" | "SCANEOF"

• punctuation = '(' | ')' |'+' | '-' | '*' | '/' |'=' | '\n'

• Ignore:– whitespace (but not newlines)– comments ("//" to the end of the line)


5.2. From REs to Tokens

• Using the REs as a guide, we create tokens and token values. How?

• In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.


Tokens and Token Values

• Token Token ValueID "var" and the id stringINT "num" and the value

LPAREN '('RPAREN ')'PLUSOP '+'MINUSOP '-'MULTOP '*'DIVOP '/'


• Token Token ValueASSIGNOP '='NEWLINE '\n'

LET "let"SCANEOF eof character


6. exprTokens.c

• exprTokens.c is a lexical analyzer for the expressions language.

• It reads in an expressions program on stdin, and prints out the tokens (and their values).


6.1. Usage

> gcc -Wall -o exprTokens exprTokens.c

> ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof'

>

or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/


6.2. Code• // constants for tokens and their values

#define NUMKEYS 2

typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF} Token;

char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=",

"+", "-", "*", "/", "eof"};

char *keywords[NUMKEYS] = {"let", "SCANEOF"};Token keywordToks[NUMKEYS] = {LET, SCANEOF};


Callgraph for exrprTokens.c

calls


main() and its globals

• Token currToken;int lineNum = 1; // num lines read in

int main(void){ printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF);

return 0;}


Printing the Tokens• #define MAX_IDLEN 30

char tokString[MAX_IDLEN];int currTokValue; // used when token is an integer

void printToken(void){ if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT) // a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks} // end of printToken()


Getting a Token

• void nextToken(void){ currToken = scanner(); }


scanner() OverviewToken scanner(void) // converts chars into a token{ int inCh; clearTokStr();

if (feof(stdin)) return SCANEOF;

while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;


else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token

// return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token

// change token to int return INT; } else if (inCh == '(') return LPAREN;

else if ... // more tests of inCh ...

else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF;} // end of scanner()

punctuation


Processing an ID

:else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')*

extendTokStr(inCh);

for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar())

extendTokStr(inCh);

ungetc(inCh, stdin);

return checkKeyword();

} :

in scanner()


Token String Functionsvoid clearTokStr(void)// reset the token string to be empty{ tokString[0] = '\0'; tokStrLen = 0;} // end of clearTokStr()

void extendTokStr(char ch)// add ch to the end of the token string{ if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string }} // end of extendTokStr()


Checking for a Keyword

Token checkKeyword(void)

{

int i;

for(i=0; i<NUMKEYS; i++) {

if(!strcmp(tokString, keywords[i]))

return keywordToks[i];

}

return ID;

} // end of checkKeyword()


Processing an INT

:else if (isdigit(inCh)){ // INT = DIGIT+

extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh);

inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; }

:

in scanner()


Reporting an Error

void lexicalErr(char ch){

printf("Lexical error at \"%c\" on line %d\n", ch, lineNum);

exit(1);}

No recovery attempted.


6.3. Some Good News

• Most programming languages use very similar lexical analyzers– e.g. the same kind of IDs, INTs, punctuation,

and keywords

• Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.


7. From REs to Code Automatically

1. Write the REs for the language.

2. Convert to Non-deterministic Finite Automata (NFA).

3. Convert to Deterministic Finite Automata (DFA)

4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser.

• There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.

241-437 compilers: lex analysis/2 1 compiler structures objective – –what is lexical analysis?...

Documents