241-437 compilers: lex analysis/2 1 compiler structures objective – –what is lexical analysis?...
TRANSCRIPT
241-437 Compilers: lex analysis/2 1
Compiler Structures
• Objective– what is lexical analysis?– look at a lexical analyzer for a simple 'expressions' language
241-437, Semester 1, 2011-2012
2. Lexical Analysis
241-437 Compilers: lex analysis/2 2
Overview
1. Why Lexical Analysis?
2. Using a Lexical Analyzer
3. Implementing a Lexical Analyzer
4. Regular Expressions (REs)
5. The Expressions Language
6. exprTokens.c
7. From REs to Code Automatically
241-437 Compilers: lex analysis/2 3
In this lecture
Source Program
Target Lang. Prog.
Semantic Analyzer
Syntax Analyzer
Lexical Analyzer
FrontEnd
Code Optimizer
Target Code Generator
BackEnd
Int. Code Generator
Intermediate Code
241-437 Compilers: lex analysis/2 4
1. Why Lexical Analysis?
• Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants)
• Simplifies the design of the rest of the compiler– the code uses tokens, not strings or characters
• Can be implemented efficiently– by hand or automatically
• Improves portability– non-standard symbols / foreign characters are translated here, so do
not affect the rest of the compiler
241-437 Compilers: lex analysis/2 5
2. Using a Lexical Analyzer
LexicalAnalyzer
(using chars)
SyntaxAnalyzer
(using tokens)
SourceProgram
3. Token,token value
1. Get nexttoken
lexicalerrors
syntaxerrors
2. Get charsto makea token
241-437 Compilers: lex analysis/2 6
A Source Program is Chars
Consider the program fragment:
if (i==j);z=1;
else;z=0;
endif;
The lexical analyzer reads it in as a string of characters:
if_(i==j);\n\tz=1;\nelse;\n\tz=0;\nendif;
Lexical analysis divides the string into tokens.
241-437 Compilers: lex analysis/2 7
Tokens and Token Values
Lexical Analyzer
<id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”>
"y = 31 + 28*foo"
Syntax Analyzer
token
token value
get tokens(one at a time)
get chars
241-437 Compilers: lex analysis/2 8
Tokens, Lexemes, and Patterns
• A token is a lexical type– e.g id, int
• A lexeme is a token value– e.g. "abc", 123
• A pattern says how to make a token from chars– e.g. id = letter followed by letters and digits
int = non-empty sequence of digits
– a pattern is defined using regular expressions (REs)
241-437 Compilers: lex analysis/2 9
3. Implementing a Lexical Analyzer
Issues:• Lookahead
– how to group chars into tokens
• Ignoring whitespace and comments.• Separating variables from keywords
– e.g. "if", "else"
• (Automatically) translating REs into a lexical analyzer.
241-437 Compilers: lex analysis/2 10
Lookahead
• A token is created by reading in characters, and grouping them together.
• It is not always possible to decide if a token is finished without looking ahead at the next char.
• For example:– Is "i" a variable, or the first character of "if"?– Is "=" an assignment or the beginning of "=="?
241-437 Compilers: lex analysis/2 11
4. Regular Expressions (REs)
• REs are an algebraic way of specifying how to recognise input– ‘algebraic’ means that the recognition pattern is
defined using RE operands and operators
Covered in moredetail in 240-304"maths for CoE"
241-437 Compilers: lex analysis/2 12
4.1. REs in grep
• grep searches input lines, a line at a time.• If the line contains a string that matches gre
p's RE (pattern), then the line is output.
grep "RE"
input lines(e.g. from a file)
hello andymy name is andymy bye byhe
output matching lines(e.g. to a file)
continued
241-437 Compilers: lex analysis/2 13
Examples
grep "and"hello andymy name is andymy bye byhe
hello andymy name is andy
hello andymy name is andymy bye byhe
hello andymy name is andymy bye byhe
continued
"|" means "or"
grep -E "an|my"
241-437 Compilers: lex analysis/2 14
grep "hel*"hello andymy name is andymy bye byhe
hello andymy bye byhe
"*" means "0 or more"
241-437 Compilers: lex analysis/2 15
4.2. The RE Language
• A RE defines a pattern which recognises (matches) a set of strings– e.g. a RE can be defined that recognises the st
rings { aa, aba, abba, abbba, abbbba, …}
• These recognisable strings are sometimes called the RE’s language.
241-437 Compilers: lex analysis/2 16
RE Operands
• There are 4 basic kinds of operands:– characters (e.g. ‘a’, ‘1’, ‘(‘)
– the symbol (means an empty string ‘’)
– the symbol {} (means the empty set)
– variables, which can be assigned a RE• variable = RE
241-437 Compilers: lex analysis/2 17
RE Operators
• There are three basic operators:– union ‘|’– concatenation – closure *
241-437 Compilers: lex analysis/2 18
Union
• S | T– this RE can use the S or T RE to match strings
• Example REs:a | b matches strings {a, b}
a | b | c matches strings {a, b, c }
241-437 Compilers: lex analysis/2 19
Concatenation
• S T– this RE will use the S RE followed by the T RE
to match against strings
• Example REs:a b matches the string { ab }
w | (a b) matches the strings {w, ab}
241-437 Compilers: lex analysis/2 20
• What strings are matched by the RE(a | ab ) (c | bc)
• Equivalent to:{a, ab} followed by {c, bc}
=> {ac, abc, abc, abbc}
=> {ac, abc, abbc}
241-437 Compilers: lex analysis/2 21
Closure
• S*– this RE can use the S RE 0 or more times to ma
tch against strings
• Example RE:a* matches the strings:
{, a, aa, aaa, aaaa, aaaaa, ... }
empty string
241-437 Compilers: lex analysis/2 22
4.3. REs for C Identifiers
• We define two RE variables, letter and digit:letter = A | B | C | D ... Z |
a | b | c | d .... z
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
• id is defined using letter and digit:id = letter ( letter | digit )*
continued
241-437 Compilers: lex analysis/2 23
• Strings matched by id include:ab345 w h5g
• Strings not matched:2 $abc ****
241-437 Compilers: lex analysis/2 24
4.4. RE SummaryExpression Meaning
Empty patterna Any pattern represented by ‘a’ab Strings with pattern ‘a’ followed by ‘b’a|b Strings consisting of pattern ‘a’ or ‘b’a* Zero or more occurrences of patterns in ‘a’a+ One or more occurrences of patterns in ‘a’a3 Patterns in ‘a’ repeated exactly 3 times
a? (a | ) ; Optional single pattern from ‘a’. Any single character
241-437 Compilers: lex analysis/2 25
More Operators
• See the regular expressions "cheat-sheet" at See the regular expressions "cheat-sheet" at the course website in the "Useful Info" the course website in the "Useful Info" subdirectory:subdirectory:– over 80 operators!!over 80 operators!!
241-437 Compilers: lex analysis/2 26
Wild Card Symbol: '.'
• The ‘.’ stands for any character except the newline– e.g. grep ‘a..b.$’ chapter1.txt
grep ‘t.*t.*t’
/usr/share/dict/words
the UNIX/Linux 'dictionary'
241-437 Compilers: lex analysis/2 27
grep "a..b."AA'sAOLAOL's : :
adobealibiameba
/usr/share/dict/words
241-437 Compilers: lex analysis/2 28
4.5. REs for Integers and Floats
• We redefine digit: digit = 0|1|2|3|4|5|6|7|8|9
or digit = [1 – 9]
• int and float:int = {digit}+
float = {digit}+ "." {digit}+
241-437 Compilers: lex analysis/2 29
• Integers and floats with exponents:number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?
241-437 Compilers: lex analysis/2 30
4.6 More on REs
See RE summary on the course website:regular_expressions_cheat_sheet.pdf
I have the standard RE book:– Mastering Regular Expressions
Jeffrey E. F. FreidlO'Reilly & Associates
continued
241-437 Compilers: lex analysis/2 31
There are many websites that explain REs:
http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html
http://www.zytrax.com/tech/web/regex.htm
http://www.regular-expressions.info
241-437 Compilers: lex analysis/2 32
5. The Expressions Language
• In my expressions language, a program is a series of expressions and assignments.
• Example:
// test2.txt example
let x56 = 2let bing_BONG = (27 * 2) - x565 * (67 / 3)
241-437 Compilers: lex analysis/2 33
5.1. REs for the Language
• alpha = a | b | c | ... | z | A | B | ... | Z• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• alphanum = alpha | digit
• id = alpha (alphanum | '_' )*• int = digit+
241-437 Compilers: lex analysis/2 34
• keywords = "let" | "SCANEOF"
• punctuation = '(' | ')' |'+' | '-' | '*' | '/' |'=' | '\n'
• Ignore:– whitespace (but not newlines)– comments ("//" to the end of the line)
241-437 Compilers: lex analysis/2 35
5.2. From REs to Tokens
• Using the REs as a guide, we create tokens and token values. How?
• In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.
241-437 Compilers: lex analysis/2 36
Tokens and Token Values
• Token Token ValueID "var" and the id stringINT "num" and the value
LPAREN '('RPAREN ')'PLUSOP '+'MINUSOP '-'MULTOP '*'DIVOP '/'
241-437 Compilers: lex analysis/2 37
• Token Token ValueASSIGNOP '='NEWLINE '\n'
LET "let"SCANEOF eof character
241-437 Compilers: lex analysis/2 38
6. exprTokens.c
• exprTokens.c is a lexical analyzer for the expressions language.
• It reads in an expressions program on stdin, and prints out the tokens (and their values).
241-437 Compilers: lex analysis/2 39
6.1. Usage
> gcc -Wall -o exprTokens exprTokens.c
> ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof'
>
or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/
241-437 Compilers: lex analysis/2 40
6.2. Code• // constants for tokens and their values
#define NUMKEYS 2
typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF} Token;
char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=",
"+", "-", "*", "/", "eof"};
char *keywords[NUMKEYS] = {"let", "SCANEOF"};Token keywordToks[NUMKEYS] = {LET, SCANEOF};
241-437 Compilers: lex analysis/2 41
Callgraph for exrprTokens.c
calls
241-437 Compilers: lex analysis/2 42
main() and its globals
• Token currToken;int lineNum = 1; // num lines read in
int main(void){ printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF);
return 0;}
241-437 Compilers: lex analysis/2 43
Printing the Tokens• #define MAX_IDLEN 30
char tokString[MAX_IDLEN];int currTokValue; // used when token is an integer
void printToken(void){ if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT) // a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks} // end of printToken()
241-437 Compilers: lex analysis/2 44
Getting a Token
• void nextToken(void){ currToken = scanner(); }
241-437 Compilers: lex analysis/2 45
scanner() OverviewToken scanner(void) // converts chars into a token{ int inCh; clearTokStr();
if (feof(stdin)) return SCANEOF;
while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;
241-437 Compilers: lex analysis/2 46
else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token
// return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token
// change token to int return INT; } else if (inCh == '(') return LPAREN;
else if ... // more tests of inCh ...
else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF;} // end of scanner()
punctuation
241-437 Compilers: lex analysis/2 47
Processing an ID
:else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')*
extendTokStr(inCh);
for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar())
extendTokStr(inCh);
ungetc(inCh, stdin);
return checkKeyword();
} :
in scanner()
241-437 Compilers: lex analysis/2 48
Token String Functionsvoid clearTokStr(void)// reset the token string to be empty{ tokString[0] = '\0'; tokStrLen = 0;} // end of clearTokStr()
void extendTokStr(char ch)// add ch to the end of the token string{ if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string }} // end of extendTokStr()
241-437 Compilers: lex analysis/2 49
Checking for a Keyword
Token checkKeyword(void)
{
int i;
for(i=0; i<NUMKEYS; i++) {
if(!strcmp(tokString, keywords[i]))
return keywordToks[i];
}
return ID;
} // end of checkKeyword()
241-437 Compilers: lex analysis/2 50
Processing an INT
:else if (isdigit(inCh)){ // INT = DIGIT+
extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh);
inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; }
:
in scanner()
241-437 Compilers: lex analysis/2 51
Reporting an Error
void lexicalErr(char ch){
printf("Lexical error at \"%c\" on line %d\n", ch, lineNum);
exit(1);}
No recovery attempted.
241-437 Compilers: lex analysis/2 52
6.3. Some Good News
• Most programming languages use very similar lexical analyzers– e.g. the same kind of IDs, INTs, punctuation,
and keywords
• Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.
241-437 Compilers: lex analysis/2 53
7. From REs to Code Automatically
1. Write the REs for the language.
2. Convert to Non-deterministic Finite Automata (NFA).
3. Convert to Deterministic Finite Automata (DFA)
4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser.
• There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.