cse p501 – compiler construction scanner regex automata hand-written scanner grammars & bnf...

58
CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014 Jim Hogg - UW - CSE P501 B-1

Upload: brooke-andrews

Post on 17-Jan-2016

243 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

CSE P501 – Compiler Construction

Scanner

Regex

Automata

Hand-Written Scanner

Grammars & BNF

Next

Spring 2014 Jim Hogg - UW - CSE P501 B-1

Page 2: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 A-2

Source TargetFront End Back End

Scan

chars

tokens

AST

IR

AST = Abstract Syntax Tree

IR = Intermediate Representation

‘Middle End’

Optimize

Select Instructions

Parse

Semantics

Allocate Registers

Emit

Machine Code

IR

IR

IR

IR

IR

Scanner

Page 3: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Automatic or Hand-Written?

Use a scanner-generator - JFlex

Spring 2014 Jim Hogg - UW - CSE P501 B-3

regex define tokens

JFlex Scanner

.jflex .java

Write a scanner, in Java, by hand Easy and enlightening Will see an outline of how, later

OR

Page 4: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Reminder: a token is . . .

Spring 2014 Jim Hogg - UW - CSE P501 A-4

class C { public int fac(int n) { // factorial int nn; if (n < 1) nn = 1; else nn = n * this.fac(n-1); return nn; }}

class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊}

Key for Char Stream:

◊ newline \n∙ space

CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE

Page 5: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

A Token in your Java scanner

class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme; // eg: “x”, “Total”, “(“, “42” public int value; // attribute of ILIT}

Spring 2014 Jim Hogg - UW - CSE P501 B-5

Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et• value is only required for ILIT

But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants!

Page 6: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-6

Typical Tokens

Operators & Punctuation Single chars: + - * = / ( ] ; : Double chars: :: <= == !=

Keywords if while for goto return switch void …

Identifiers A single ID token kind, parameterized by lexeme

Integer constants A single ILIT token kind, parameterized by int value

See jflex-1.5.0\examples\java\java.flex for real example

Page 7: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Token Spotting

Spring 2014 Jim Hogg - UW - CSE P501 B-7

if(a<=3)++grades[1]; // what are the tokens? (no spaces)

public int fac(int n) { // what are the tokens? (need spaces?)

Counter-example: fixed-format FORTRAN:

DO 50 I = 1,99 // DO loopDO 50 I = 1.2 // assignment: DO50I = 1.2

Page 8: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-8

Principle of Longest Match

Scanner should pick the longest possible string to make up the next token (“greedy” algorithm)

Examplereturn idx <= iffy;

should be scanned into 5 tokens:

<= is one token, not two iffy is an ID, not IF followed by ID:fy

RETURN ID:idx LEQ ID:iffy SEMI

Page 9: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-9

The syntax, of most programming languages can be specified using Regular Expressions “REs” in Cooper&Torczon “regex” is more common

Tokens can be recognized by a deterministic finite automaton (DFA) DFA (a Java class) is almost always

generated from regex using a software tool, such as JFlex

Regex

Page 10: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Regex Cheat Sheet

Spring 2014 Jim Hogg - UW - CSE P501 B-10

Pattern Matches?

a a

a* zero or more a’s

a+ one or more a’s

a? zero or one a

a|b a or b

ab a followed by bPrecedence: * (highest), concatenation, | (lowest)

Parentheses can be used to group regexs as needed

Notice meta-characters, in red

Escaped characters: \* \+ \? \| \. \t \n

Pattern Matches?

[c-f] one of c or d or e or f

[^0-3] any one character except 0-3

. any character, except newline

Page 11: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-11

Regex Examples

regex Meaning?

[abc]+

[abc]* (Kleene closure)

[0-9]+

[1-9][0-9]*

[a-zA-Z_][a-zA-Z0-9_]*

(0|1)* 0

(a|b)*aa(a|b)*

Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/

Page 12: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-12

regex

Defined over some alphabet Σ For programming languages, alphabet is ASCII or

Unicode

If re is a regular expression, L(re ) is the language (set of strings) generated by re

Page 13: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-13

regex macros

Possible syntax for numeric constantsDigit = [0-9]Digits = Digit+

Number = Digits ( . Digits )? ( [eE] (+ | -)? Digits ) ?

How would you describe this set in English?

What are some examples of legal constants (strings) generated by Number?

Tools like JFlex accept these convenient macros

Page 14: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-14

Finite automata (state machines) can be used to recognize strings generated by regular expressions

Can build automaton by-hand or automagically Will not build by-hand in this course Will use the JFlex tool: given a set of regex, it

generates an automaton recognizer (a Java class)

Automata

Page 15: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-15

Finite Automata Terminology

Phrase Abbreviation

Finite Automaton FA

Deterministic Finite Automaton DFA

Non-deterministic Finite Automaton NFA

Finite-State Automaton FSA = {DFA, NFA}

Page 16: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-16

DFA for “cat”

a tc

Accepting State(double circles)

Start State

regex = cat

Page 17: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-17

DFA for ILIT

0-91

0-9

2

We have labelled the states

regex = [0-9][0-9]* = [0-9]+

Page 18: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-18

DFA for ID

a-z

0 0-9

1

a-z

regex = [a-zA-Z_][a-zA-Z0-9_]*

A-Z_

A-Z_

Page 19: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

DFAs work like this . . .

Spring 2014 Jim Hogg - UW - CSE P501 B-19

1. scan the input text string, character-by-character

2. following the arc/edge corresponding to the character just read

3. if there is no arc for the character just read, then, either:

a. if you are in an accepting state: you're done. Success!

b. if you are not in an accepting state: you're done. Failure!

Page 20: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

DFAs work like this - examples

Spring 2014 Jim Hogg - UW - CSE P501 B-20

1. Scan "fac(int n);" for the regex, alphaid = [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success

2. Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure

3. Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success

0 1

a-za-z

Note: no need to add arcs to the DFA for all error cases - they are implicit

Page 21: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Thompson’s Construction: Combining DFAs

Spring 2014 Jim Hogg - UW - CSE P501 B-21

ε

a b

DFA for: a DFA for: b

a bNFA for: ab

εa

b

NFA for a|b

ε

ε

ε

Page 22: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Combining DFAs, cont’d

Spring 2014 Jim Hogg - UW - CSE P501 B-22

ε

a b

DFA for: a DFA for: b

aNFA for: a*

ε

ε

ε

Page 23: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Exercise

Draw the NFA for: b(at|ag) | bug

Spring 2014 Jim Hogg - UW - CSE P501 B-23

b

a t

ub g

a g

Page 24: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Exercise

Draw the NFA for: b(at|ag) | bug

Spring 2014 Jim Hogg - UW - CSE P501B-24

b

a t

ub g

a g

Page 25: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA for a(b|c)*

Spring 2014 Jim Hogg - UW - CSE P501 B-25

b

c

a

a

b

c

To recognize "acb" successfully, we need to:

• guess the future correctly• backtrack and retry if we fail to

recognize• somehow execute all possible paths

None of these is attractive! Can we construct an equivalent DFA?

Page 26: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-26

Finite State Automaton (FSA)

A finite set of states One marked as initial state One or more marked as final states States sometimes labeled or numbered

A set of transitions from state to state Each labeled with symbol from Σ, or ε

Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol ε-transition can be taken at any time (free bus ride)

Accept when final state reached & no more input Scanner uses an FSA as a subroutine – accept longest

match from current location each time called, even if more input

Reject if no transition possible, or no more input and not in final state (DFA)

Page 27: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-27

DFA vs NFA

Deterministic Finite Automata (DFA) No choice of which transition to take In particular, no ε transitions No guessing

Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accepts if some way to reach final state on given

input Reject if no possible way to final state How to implement in software?

Page 28: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-28

DFAs in Scanners

We really want DFA for speed: no backtracking, no guessing, no foretelling the future

Conversion from regex to NFA is easy, right?

But how to turn an NFA into an equivalent DFA?

Turns out to be obvious (once seen) and easy

Page 29: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501B-29

Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with:

0a

c

b

4b

6c

3

5

7

2 8

NFA for a(b|c)*

0a

1 9

1

Page 30: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-30

4b

6c

3

5

7

2 8

NFA for a(b|c)*

0a

1 9

• Begin in the Start state• Foreach labelled arc leaving that state, what set of states can I

reach, along labelled arc, or along transitions?

Page 31: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-31

n4b

n6c

n3

n5

n7

n2 n8

NFA for a(b|c)*

n0a

n1 n9

NFA State a b c

d0 = n0 d1 = {1,2,3,4,6,9}

none none

d1 = {1,2,3,4,6,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d2 = {3,4,5,6,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d3 = {3,4,6,7,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

Page 32: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA to DFA

Spring 2014 Jim Hogg - UW - CSE P501 B-32

b

c

DFA for a(b|c)*

d0a bc

c

b

NFA State a b c

d0 d1 - -

d1 - d2 d3

d2 - d2 d3

d3 - d2 d3

d2

d1

d3

Page 33: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

NFA to DFA - Even Better

Spring 2014 Jim Hogg - UW - CSE P501 B-33

DFA for a(b|c)*

d0a

c

b

• Can reduce number of states further, to yield above result

• If interested, see books for details

• States minimization is not examined in P501

d1

Page 34: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-34

From NFA to DFA

Subset construction (equivalence class) Construct DFA from NFA, where each DFA state

represents a set of NFA states

Key idea State of DFA after reading some input is the set of all

states the NFA could have reached after reading the same input

Algorithm: example of a fixed-point computation

If NFA has n states, DFA has at most 2n states => DFA is finite, can construct in finite # steps

Page 35: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Build DFA for: b(at|ag) | bug from its NFA

Spring 2014 Jim Hogg - UW - CSE P501B-35

b

a

1

3t

u

0

b98 10

g

42

a6

g75

11

12

NFA State a b g t u

d0 = 0 - {1,2,5,9} - - -

d1 = {1,2,5,9} ? ? ? ? ?

? ? ? ? ? ?

Page 36: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Build DFA for: b(at|ag) | bug from its NFA

Spring 2014 Jim Hogg - UW - CSE P501 B-36

b

a

1

3t

u

0

b98 10

g

42

a6

g75

11

12

NFA State a b g t u

d0={0} - d1={1,2,5,9} - - -

d1 = {1,2,5,9} d2={3,6} - - - d3={10}

d2 = {3,6} - - d4={7} d5={4,12}

-

d3 = {10} - - d6={11,12}

- -

TBD ? ? ? ? ?

Page 37: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Idea: show a hand-written DFA for some typical tokens Then use to construct hand-written scanner

Setting: Parser calls scanner whenever it wants next token JFlex provides next_token Scanner stores current position in input

For illustration only. Course project will use JFlex scanner-generator

Note - most commercial compilers use hand-written scanners - generally faster

Spring 2014 Jim Hogg - UW - CSE P501 B-37

Hand-Written Scanner

Page 38: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-38

Scanner DFA Example – Part 1

0

Accept LPAREN(

2

Accept RPAREN)

3

whitespaceor comments

Accept SEMI;

4

Accept EOFend of input

1

Page 39: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-39

Scanner DFA Example – Part 2

Accept NEQ! 6

Accept NOT7

5=

[other ]

Accept LEQ< 9

Accept LESS10

8=

[other ]

Page 40: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-40

Scanner DFA Example – Part 3

[0-9]

Accept ILIT12

11

[other ]

[0-9]

Page 41: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-41

Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of

keywords Machine-generated scanner: generate DFA with appropriate

transitions to recognize keywords

Scanner DFA Example – Part 4

[a-zA-Z]

Accept ID or keyword14

13

[other ]

[a-zA-Z0-9_]

Page 42: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Scanner – class, ctor, skipWhite

public class Scanner { private String prog; // the MiniJava program to be scanned private int p; // index in 'prog' of current char

public Scanner(String prog) { this.prog = prog; p = 0; }

private void skipWhite() { char c = prog.charAt(p); while ( Character.isWhitespace(c) ) c = prog.charAt(++p); }

Spring 2014 Jim Hogg - UW - CSE P501 B-42

Page 43: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Scanner- id

private Token id() { int pBegin = p; // remember begin index of id char c = prog.charAt(p); // current char - alphabetic

while ( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return new Token(ID, prog.substring(pBegin, p));}

Spring 2014 Jim Hogg - UW - CSE P501 B-43

Page 44: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Scanner - iLit

private Token iLit() { int pBegin = p; // remember begin index of lexeme char c = prog.charAt(p); // current char int val = Character.getNumericValue(c); // convert to int

while ( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val = 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return new Token(ID, lex, val);}

Spring 2014 Jim Hogg - UW - CSE P501 B-44

Page 45: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Scanner - nextToken

public Token nextToken() { skipWhitespace(); // returns at prog[p] char c = prog.charAt(p); // current char in 'prog' char n = prog.charAt(p + 1); // next char in 'prog'

switch (c) { case ‘>': if (n == '=') { p++; p++; return new Token(GEQ, “>="); } else { p++; return new Token(GT, “>"); } // . . . case '+': p++; return new Token(PLUS, "+"); // . . . } // end of switch

Spring 2014 Jim Hogg - UW - CSE P501 B-45

Page 46: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Scanner – nextToken, cont’d

if (Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return new Token(BAD, ""); } } // end of nextToken

} // end of class Scanner

Spring 2014 Jim Hogg - UW - CSE P501 B-46

An entire hand-written scanner for MiniJava takes ~100 lines of Java

Page 47: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-47

Since the 60s, the syntax of every significant programming language has been specified by a formal grammar

First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax

Borrowed from the linguistics community (Noam Chomsky)

Grammars & BNF

Page 48: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-48

Grammar for a Tiny Language

program statement | program statement statement assignStmt | ifStmt assignStmt id = expr ; ifStmt if ( expr ) statement expr id | ilit | expr + expr id a | b | c | i | j | k | n | x | y | z ilit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Note: often see ::= used instead of

Page 49: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-49

Example Derivation

a = 1 ; if ( a + 1 ) b = 2 ;

program ::= statement | program statementstatement ::= assignStmt | ifStmtassignStmt ::= id = expr ;ifStmt ::= if ( expr ) statementexpr ::= id | ilit | expr + exprid ::= a | b | c | i | j | k | n | x | y | zilit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 50: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

B-50

Parse Tree - First Few Steps

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

;

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 51: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

B-51

Parse Tree - Complete

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

I

SE(if )

EE +

id ilit

A

= Eid

ilit

;

;

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

Page 52: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-52

Alternative Notations

There are several syntax notations for productions in common use; all mean the same thing

ifStmt ::= if ( expr ) statement

ifStmt if ( expr ) statement

<ifStmt> ::= if ( <expr> ) <statement>

Page 53: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-53

Formal Languages & Automata Theory

Alphabet: a finite set of symbols ( eg: [a-zA-Z0-9_] )

String: a finite, possibly empty sequence of symbols from an alphabet

Language: a set, often infinite, of strings

Finite specifications of (possibly infinite) languages Grammar – a generator; a system for producing all strings in the

language (and no other strings)

A particular language may be specified by many different grammars

A grammar specifies only one language

Page 54: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-54

Productions

The rules of a grammar are called productions

Rules contain Nonterminal symbols: grammar variables (program,

statement, id, etc) Terminal symbols: concrete syntax that appears in

programs (a, b, c, 0, 1, if, (, ), … )

Meaning of nonterminal <sequence of terminals and non-terminals>

In a derivation, an instance of non-terminal can be replaced by the sequence of terminals and non-terminals on its RHS

Often, there are two or more productions for one nonterminal – use any in different parts of derivation

Page 55: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-55

Two ways to Parse

Parse: re-construct the derivation (syntactic structure) of a program

More prosaically: fill the gap between top and bottom of page with a parse tree:

Start at top; build tree downwards, sweeping left-to-right. This is called a "top-down" parse. What we just did for the "Tiny Language" example

Start at bottom; build little trees that join upwards. Called a "bottom-up" parse. What CUP does for us.

Page 56: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-56

Why Separate Scanner and Parser?

In principle, a single recognizer could work directly from a concrete, character-by-character grammar

In practice this is never done: always scan chars to tokens, because:

Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, input files,

etc) Parser becomes easier to build; has simpler input - stream-of-tokens

Efficiency Scanner can use simpler, fast design But still often consumes a surprising amount of the compiler’s total

execution time - it touches every char in source program

Page 57: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-57

Project Notes

For MiniJava project Use JFlex scanner-generator tool Use CUP parser-generator tool The two work together

CUP generates a file of token kinds into sym.java (SEMI = 28, LT = 18, etc)

JFlex needs these definitions. To bootstrap this process, inspect the MiniJava grammar and devise your own set of token kinds

See MiniJava page at: http://www.cambridge.org/resources/052182060X/

Page 58: CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next Spring 2014Jim Hogg - UW - CSE P501B-1

Spring 2014 Jim Hogg - UW - CSE P501 B-58

Homework: paper exercises on regex and FAs

Next week: first part of the compiler assignment – the scanner

Send partner info to Nat if you want project space

Next topic: parsing Will do LR parsing first, for the project (CUP) Cooper&Torczon chapter 3

Next