cos 320 compilers
DESCRIPTION
COS 320 Compilers. David Walker. Outline. Last Week Introduction to ML Today: Lexical Analysis Reading: Chapter 2 of Appel. The Front End. Lexical Analysis : Create sequence of tokens from characters Syntax Analysis : Create abstract syntax tree from sequence of tokens - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/1.jpg)
COS 320Compilers
David Walker
![Page 2: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/2.jpg)
Outline
• Last Week– Introduction to ML
• Today:– Lexical Analysis– Reading: Chapter 2 of Appel
![Page 3: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/3.jpg)
The Front End
• Lexical Analysis: Create sequence of tokens from characters
• Syntax Analysis: Create abstract syntax tree from sequence of tokens
• Type Checking: Check program for well-formedness constraints
Lexer Parser
stream ofcharacters
stream oftokens
abstractsyntax
TypeChecker
![Page 4: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/4.jpg)
Lexical Analysis
• Lexical Analysis: Breaks stream of ASCII characters (source) into tokens
• Token: An atomic unit of program syntax– i.e., a word as opposed to a sentence
• Tokens and their types:Type:IDREALSEMILPARENNUMIF
Characters Recognized:foo, x, listcount10.45, 3.14, -2.1;(50, 100if
Token:ID(foo), ID(x), ...REAL(10.45), REAL(3.14), ...SEMILPARENNUM(50), NUM(100)IF
![Page 5: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/5.jpg)
Lexical Analysis Examplex = ( y + 4.0 ) ;
![Page 6: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/6.jpg)
Lexical Analysis Examplex = ( y + 4.0 ) ;
ID(x)
Lexical Analysis
![Page 7: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/7.jpg)
Lexical Analysis Examplex = ( y + 4.0 ) ;
ID(x) ASSIGN
Lexical Analysis
![Page 8: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/8.jpg)
Lexical Analysis Examplex = ( y + 4.0 ) ;
ID(x) ASSIGN LPAREN ID(y) PLUS REAL(4.0) RPAREN SEMI
Lexical Analysis
![Page 9: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/9.jpg)
Lexer Implementation• Implementation Options:
1. Write a Lexer from scratch– Boring, error-prone and too much work
![Page 10: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/10.jpg)
Lexer Implementation• Implementation Options:
1. Write a Lexer from scratch– Boring, error-prone and too much work
2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.
Lexer Specification
![Page 11: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/11.jpg)
Lexer Implementation• Implementation Options:
1. Write a Lexer from scratch– Boring, error-prone and too much work
2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.
Lexer Specification
lexergenerator
Lexer
![Page 12: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/12.jpg)
Lexer Implementation• Implementation Options:
1. Write a Lexer from scratch– Boring, error-prone and too much work
2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.
Lexer Specification
lexergenerator
Lexer
stream ofcharacters
stream oftokens
![Page 13: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/13.jpg)
• How do we specify the lexer?– Develop another language – We’ll use a language involving regular
expressions to specify tokens
• What is a lexer generator?– Another compiler ....
![Page 14: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/14.jpg)
Some Definitions• We will want to define the language of legal tokens
our lexer can recognize– Alphabet – a collection of symbols (ASCII is an alphabet)– String – a finite sequence of symbols taken from our alphabet
– Language of legal tokens – a set of strings• Language of ML keywords – set of all strings which are ML
keywords (FINITE)• Language of ML tokens – set of all strings which map to ML tokens
(INFINITE)• A language can also be a more general set of strings:
– eg: ML Language – set of all strings representing correct ML programs (INFINITE).
![Page 15: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/15.jpg)
Regular Expressions: Construction
• Base Cases:– For each symbol a in alphabet, a is a RE denoting the
set {a}– Epsilon (e) denotes { }
• Inductive Cases (M and N are REs)– Alternation (M | N) denotes strings in M or N
• (a | b) == {a, b}– Concatenation (M N) denotes strings in M
concatenated with strings in N• (a | b) (a | c) == { aa, ac, ba, bc }
– Kleene closure (M*) denotes strings formed by any number of repetitions of strings in M
• (a | b )* == {e, a, b, aa, ab, ba, bb, ...}
![Page 16: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/16.jpg)
Regular Expressions
• Integers begin with an optional minus sign, continue with a sequence of digits
• Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
![Page 17: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/17.jpg)
Regular Expressions
• Integers begin with an optional minus sign, continue with a sequence of digits
• Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
• So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) and even worse (a | b | c | ...) gets tedious...
![Page 18: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/18.jpg)
Regular Expressions (REs)
• common abbreviations: – [a-c] == (a | b | c)– . == any character except \n– \n == new line character– a+ == one or more– a? == zero or one
• all abbreviations can be defined in terms of the “standard” REs
![Page 19: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/19.jpg)
Ambiguous Token Rule Sets
• A single RE is a completely unambiguous specification of a token.– call the association of an RE with a token a “rule”
• To lex an entire programming language, we need many rules– but ambiguities arise:
• multiple REs or sequences of REs match the same string
• hence many token sequences possible
![Page 20: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/20.jpg)
Ambiguous Token Rule Sets
• Example:– Identifier tokens: [a-z] [a-z0-9]*– Sample keyword tokens: if, then, ...
• How do we tokenize:– foobar ==> ID(foobar) or ID(foo)
ID(bar)– if ==> ID(if) or IF
![Page 21: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/21.jpg)
Ambiguous Token Rule Sets
• We resolve ambiguities using two conventions:– Longest match: The regular expression that
matches the longest string takes precedence.– Rule Priority: The regular expressions
identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.
![Page 22: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/22.jpg)
Ambiguous Token Rule Sets
• Example:– Identifier tokens: [a-z] [a-z0-9]*– Sample keyword tokens: if, then, ...
• How do we tokenize:– foobar ==> ID(foobar) or ID(foo) ID(bar)
• use longest match to disambiguate
– if ==> ID(if) or IF • keyword rules have higher priority than identifier rule
![Page 23: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/23.jpg)
Lexer Implementation
Implementation Options:1. Write Lexer from scratch
– Boring and error-prone
2. Use Lexical Analyzer Generator– Quick and easy
ml-lex is a lexical analyzer generator for ML.
lex and flex are lexical analyzer generators for C.
![Page 24: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/24.jpg)
ML-Lex Specification
• Lexical specification consists of 3 parts:
User Declarations (plain ML types, values, functions)
%%
ML-LEX Definitions (RE abbreviations, special stuff)
%%
Rules (association of REs with tokens) (each token will be represented in plain ML)
![Page 25: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/25.jpg)
User Declarations
• User Declarations:– User can define various values that are
available to the action fragments.– Two values must be defined in this section:
• type lexresult– type of the value returned by each rule action.
• fun eof ()– called by lexer when end of input stream is reached.
![Page 26: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/26.jpg)
ML-LEX Definitions
• ML-LEX Definitions:– User can define regular expression
abbreviations:
– Define multiple lexers to work together. Each is given a unique name.
DIGITS = [0-9] +;LETTER = [a-zA-Z];
%s LEX1 LEX2 LEX3;
![Page 27: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/27.jpg)
Rules
• Rules:
• A rule consists of a pattern and an action:– Pattern in a regular expression.– Action is a fragment of ordinary ML code.– Longest match & rule priority used for disambiguation
• Rules may be prefixed with the list of lexers that are allowed to use this rule.
<lexer_list> regular_expression => (action.code) ;
![Page 28: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/28.jpg)
Rules
• Rule actions can use any value defined in the User Declarations section, including– type lexresult
• type of value returned by each rule action
– val eof : unit -> lexresult• called by lexer when end of input stream reached
• special variables:– yytext: input substring matched by regular expression– yypos: file position of the beginning of matched string– continue (): doesn’t return token; recursively calls
lexer
![Page 29: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/29.jpg)
A Simple Lexerdatatype token = Num of int | Id of string | IF | THEN | ELSE | EOFtype lexresult = token (* mandatory *)fun eof () = EOF (* mandatory *)
fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail%%
NUM = [1-9][0-9]*ID = [a-zA-Z] ([a-zA-Z] | NUM)*
%%
if => (IF);then => (THEN);else => (ELSE);{NUM} => (Num (itos yytext));{ID} => (Id yytext);
![Page 30: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/30.jpg)
Using Multiple Lexers
• Rules prefixed with a lexer name are matched only when that lexer is executing
• Initial lexer is called INITIAL • Enter new lexer using:
– YYBEGIN LEXERNAME;
• Aside: Sometimes useful to process characters, but not return any token from the lexer. Use:– continue ();
![Page 31: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/31.jpg)
Using Multiple Lexers
type lexresult = unit (* mandatory *)fun eof () = () (* mandatory *)
%%
%s COMMENT
%%
<INITIAL> if => ();<INITIAL> [a-z]+ => ();<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());<COMMENT> “\n” | . => (continue ());
![Page 32: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/32.jpg)
A (Marginally) More Exciting Lexertype lexresult = string (* mandatory *)fun eof () = (print “End of file\n”; “EOF”) (* mandatory *)
%%
%s COMMENT
INT = [1-9] [0-9]*;
%%
<INITIAL> if => (“IF”);<INITIAL> then => (“THEN”);<INITIAL> {INT} => ( “INT(“ ^ yytext ^ “)” );<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());<COMMENT> “\n” | . => (continue ());
![Page 33: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/33.jpg)
Implementing ML-Lex
• By compiling, of course:– convert REs into non-deterministic finite automata– convert non-deterministic finite automata into
deterministic finite automata– convert deterministic finite automata into a blazingly
fast table-driven algorithm
• you did mostly everything but possibly the last step in your favorite algorithms class– need to deal with disambiguation & rule priority– need to deal with multiple lexers
![Page 34: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/34.jpg)
Refreshing your memory: RE ==> NDFA ==> DFA
Lex rules:if => (Tok.IF)
[a-z][a-z0-9]* => (Tok.Id;)
![Page 35: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/35.jpg)
Refreshing your memory: RE ==> NDFA ==> DFA
Lex rules:if => (Tok.IF)
[a-z][a-z0-9]* => (Tok.Id;)
NDFA:
1 4
2
a-z
f
i
a-z0-9
3Tok.IF
Tok.Id
![Page 36: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/36.jpg)
Refreshing your memory: RE ==> NDFA ==> DFA
Lex rules:if => (Tok.IF)
[a-z][a-z0-9]* => (Tok.Id;)
NDFA: DFA:
1 4
2
a-z
f
i
a-z0-9
3Tok.IF
Tok.Id 1 4
2,4
a-hj-z
f
i
a-z0-9
3,4Tok.IF
Tok.Id
Tok.Id
(could be Tok.Id; decision made by rule priority)
a-eg-z0-9a-z0-9
a-z0-9
![Page 37: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/37.jpg)
Table-driven algorithm
• NDFA:
1 4
2,4
a-hj-z
f
i
a-z0-9
3,4Tok.IF
Tok.Id
Tok.Id
a-eg-z0-9a-z0-9
![Page 38: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/38.jpg)
Table-driven algorithm
• NDFA (states conveniently renamed):
S1 S4
S2
a-hj-z
f
i
a-z0-9
S3Tok.IF
Tok.Id
Tok.Id
a-eg-z0-9a-z0-9
![Page 39: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/39.jpg)
Table-driven algorithm
• DFA: Transition Table:
S4 S4 S4 S4
S4 S4 S4 S4
S2 S4 S4 S4
S1 S2 S3 S4
a
b
...
i
...
S1 S4
S2
a-hj-z
f
i
a-z0-9
S3Tok.IF
Tok.Id
Tok.Id
a-eg-z0-9a-z0-9
![Page 40: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/40.jpg)
Table-driven algorithm
• DFA: Transition Table:
S4 S4 S4 S4
S4 S4 S4 S4
S2 S4 S4 S4
S1 S2 S3 S4
a
b
...
i
...
S1 S4
S2
a-hj-z
f
i
a-z0-9
S3Tok.IF
Tok.Id
Tok.Id
a-eg-z0-9a-z0-9
- Tok.Id Tok.IF Tok.Id
S1 S2 S3 S4
Final State Table:
![Page 41: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/41.jpg)
Table-driven algorithm
• DFA: Transition Table:
S4 S4 S4 S4
S4 S4 S4 S4
S2 S4 S4 S4
S1 S2 S3 S4
a
b
...
i
...
S1 S4
S2
a-hj-z
f
i
a-z0-9
S3Tok.IF
Tok.Id
Tok.Id
a-eg-z0-9a-z0-9
- Tok.Id Tok.IF Tok.Id
S1 S2 S3 S4
Final State Table:
• Algorithm:• Start in start state• Transition from one state to nextusing transition table• Every time you reach a potential finalstate, remember it + position in stream• When no more transitions apply, revert to last final state seen + position• Execute associated rule code
![Page 42: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/42.jpg)
Dealing with Multiple Lexers
Lex rules:<INITIAL> if => (Tok.IF);
<INITIAL> [a-z][a-z0-9]* => (Tok.Id);
<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());
<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());
<COMMENT> . => (continue ());
![Page 43: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/43.jpg)
Dealing with Multiple Lexers
Lex rules:<INITIAL> if => (Tok.IF);
<INITIAL> [a-z][a-z0-9]* => (Tok.Id);
<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());
<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());
<COMMENT> . => (continue ());
(*
COMMENTINITIAL
*)
[a-z][a-z0-9] .
![Page 44: COS 320 Compilers](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813af8550346895da3837a/html5/thumbnails/44.jpg)
Summary
• A Lexer:– input: stream of characters– output: stream of tokens
• Writing lexers by hand is boring, so we use a lexer generator: ml-lex– lexer generators work by converting REs through
automata theory to efficient table-driven algorithms.
• Moral: don’t underestimate your theory classes!– great application of cool theory developed in the 70s.– we’ll see more cool apps as the course progresses