compilation 0368-3133 (semester a, 2013/14) lecture 2: lexical analysis modern compiler design:...

116
Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

Upload: elmer-thomas-lawrence

Post on 02-Jan-2016

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

1

Compilation 0368-3133 (Semester A, 2013/14)

Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1

Noam Rinetzky

Page 2: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

2

Admin

• Slides: http://www.cs.tau.ac.il/~maon/…– All info: Moodle

• Class room: Trubowicz 101 (Law school) – Except 5 Nov 2013

• Mobiles ...

Page 3: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

3

What is a Compiler?

“A compiler is a computer program that transforms source code written in a programming language (source language) into another language (target language).

The most common reason for wanting to transform source code is to create an executable program.”

--Wikipedia

Page 4: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

4

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

Page 5: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

5

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

words sentences

Page 6: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

6

What does Lexical Analysis do?

• Language: fully parenthesized expressionsExpr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

( ( 23 + 7 ) * 19 )

Page 7: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

7

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 8: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

8

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 9: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

9

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 10: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

10

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RP

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 11: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

11

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 12: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

12

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Token Token …

Page 13: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

13

• Partitions the input into stream of tokens– Numbers– Identifiers– Keywords– Punctuation

• Usually represented as (kind, value) pairs– (Num, 23)– (Op, ‘*’)

• “word” in the source language• “meaningful” to the syntactical analysis

What does Lexical Analysis do?

Page 14: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

14

From scanning to parsing((23 + 7) * x)

) ? * ) 7 + 23 ( (

RP Id OP RP Num OP Num LP LP

Lexical Analyzer

program text

token stream

ParserGrammar: E ... | Id Id ‘a’ | ... | ‘z’

Op(*)

Id(?)

Num(23) Num(7)

Op(+)

Abstract Syntax Tree

validsyntaxerror

Page 15: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

15

Why Lexical Analysis?

• Simplifies the syntax analysis (parsing) – And language definition

• Modularity• Reusability • Efficiency

Page 16: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

16

Lecture goals

• Understand role & place of lexical analysis

• Lexical analysis theory• Using program generating tools

Page 17: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

17

Lecture Outline

Role & place of lexical analysis• What is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

Page 18: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

18

What is a token? (Intuitively)

• A “word” in the source language– Anything that should appear in the input to

syntax analysis• Identifiers• Values• Language keywords

• Usually, represented as a pair of (kind, value)

Page 19: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

19

Example TokensType Examples

ID foo, n_14, lastNUM 73, 00, 517, 082 REAL 66.1, .5, 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )

Page 20: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

20

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

Page 21: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

21

Some basic terminology

• Lexeme (aka symbol) - a series of letters separated from the rest of the program according to a convention (space, semi-column, comma, etc.)

• Pattern - a rule specifying a set of strings.Example: “an identifier is a string that starts with a letter and continues with letters and digits”– (Usually) a regular expression

• Token - a pair of (pattern, attributes)

Page 22: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

22

Examplevoid match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0. ;

}

VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN

LBRACE

IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN

RETURN REAL(0.0) SEMI

RBRACE

EOF

Page 23: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

23

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

• Lexemes that are recognized but get consumed rather than transmitted to parser– If– i/*comment*/f

Page 24: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

24

How can we define tokens?

• Keywords – easy!– if, then, else, for, while, …

• Identifiers? • Numerical Values?• Strings?

• Characterize unbounded sets of values using a bounded description?

Page 25: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

25

Lecture Outline

Role & place of lexical analysisWhat is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

Page 26: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

26

Regular languages

• Formal languages– Σ = finite set of letters– Word = sequence of letter– Language = set of words

• Regular languages defined equivalently by– Regular expressions– Finite-state automata

Page 27: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

27

Common format for reg-expsBasic Patterns Matching

x The character x

. Any character, usually except a new line

[xyz] Any of the characters x,y,z

^x Any character except x

Repetition Operators

R? An R or nothing (=optionally an R)

R* Zero or more occurrences of R

R+ One or more occurrences of R

Composition Operators

R1R2 An R1 followed by R2

R1|R2 Either an R1 or R2

Grouping

(R) R itself

Page 28: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

28

Examples

• ab*|cd? = • (a|b)* =• (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* =

Page 29: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

29

Escape characters

• What is the expression for one or more + symbols?– (+)+ won’t work– (\+)+ will

• backslash \ before an operator turns it to standard character– \*, \?, \+, a\(b\+\*, (a\(b\+\*)+, …

• backslash double quotes surrounds text– “a(b+*”, “a(b+*”+

Page 30: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

30

Shorthands

• Use names for expressions– letter = a | b | … | z | A | B | … | Z– letter_ = letter | _– digit = 0 | 1 | 2 | … | 9– id = letter_ (letter_ | digit)*

• Use hyphen to denote a range– letter = a-z | A-Z– digit = 0-9

Page 31: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

31

Examples

• if = if• then = then• relop = < | > | <= | >= | = | <>

• digit = 0-9• digits = digit+

Page 32: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

32

Example

• A number is

number = ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | . ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | E ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

) )

• Using shorthands it can be written as

number = digits ( | .digits ( | E (|+|-) digits ) )

Page 33: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

35

Exercise 1 - Question

• Language of rational numbers in decimal representation (no leading, ending zeros)– 0– 123.757– .933333– Not 007– Not 0.30

Page 34: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

36

Exercise 1 - Answer

• Language of rational numbers in decimal representation (no leading, ending zeros)

– Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg

Page 35: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

37

Exercise 2 - Question

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

Page 36: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

38

Exercise 2 - Answer

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

• Not regular• Context-free• Grammar: S ::= [] | [S]

Page 37: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

39

Challenge: Ambiguity

• if = if• id = letter_ (letter_ | digit)*

• “if” is a valid word in the language of identifiers… so what should it be?

• How about the identifier “iffy”?

• Solution– Always find longest matching token– Break ties using order of definitions… first definition

wins (=> list rules for keywords before identifiers)

Page 38: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

40

Creating a lexical analyzer

• Given a list of token definitions (pattern name, regex), write a program such that– Input: String to be analyzed– Output: List of tokens

• How do we build an analyzer?

Page 39: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

41

Lecture Outline

Role & place of lexical analysisWhat is a token?Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

Page 40: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

42

A Simplified Scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){ case ` `: goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; }; case `<`: … case `w`: … }

Page 41: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

43

But we have a much better way!

• Generate a lexical analyzer automatically from token definitions

• Main idea: Use finite-state automata to match regular expressions– Regular languages defined equivalently by

• Regular expressions• Finite-state automata

Page 42: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

44

Reg-exp vs. automata

• Regular expressions are declarative– Offer compact way to define a regular language

by humans– Don’t offer direct way to check whether a given

word is in the language• Automata are operative

– Define an algorithm for deciding whether a given word is in a regular language

– Not a natural notation for humans

Page 43: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

45

Overview

• Define tokens using regular expressions

• Construct a nondeterministic finite-state automaton (NFA) from regular expression

• Determinize the NFA into a deterministic finite-state automaton (DFA)

• DFA can be directly used to identify tokens

Page 44: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

46

Finite-State Automata

• Deterministic automaton• M = (, Q, , q0, F)

– - alphabet– Q – finite set of state– q0 Q – initial state– F Q – final states– δ : Q Q - transition function

Page 45: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

47

Finite-State Automata

• Non-Deterministic automaton• M = (, Q, , q0, F)

– - alphabet– Q – finite set of state– q0 Q – initial state

– F Q – final states– δ : Q ( {}) → 2Q - transition function

• Possible -transitions• For a word w, M can reach a number of states or get

stuck. If some state reached is final, M accepts w.

Page 46: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

48

Deterministic Finite automata

start

a

b

b

c

acceptingstate

startstate

transition

• An automaton is defined by states and transitions

Page 47: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

49

Accepting Words• Words are read left-to-right

c b a

start

a

b

b

c

Page 48: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

50

Accepting Words

c b a

start

a

b

b

c

• Words are read left-to-right

Page 49: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

51

Accepting Words

c b a

start

a

b

b

c

• Words are read left-to-right

Page 50: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

52

Rejecting Wordsword

acceptedc b a

start

a

b

b

c

• Words are read left-to-right

Page 51: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

53

Rejecting Words

c b b

start

a

b

b

c

Page 52: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

54

start

Rejecting Words• Missing transition means non-acceptance

c b b

a

b

b

c

Page 53: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

55

Non-deterministic automata• Allow multiple transitions from given state

labeled by same letter

start

a

a

b

c

c

b

Page 54: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

56

NFA run example

c b a

start

a

a

b

c

c

b

Page 55: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

57

NFA run example• Maintain set of states

c b a

start

a

a

b

c

c

b

Page 56: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

58

NFA run example

c b a

start

a

a

b

c

c

b

Page 57: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

59

NFA run example• Accept word if any of the states in the set is

acceptingc b a

start

a

a

b

c

c

b

Page 58: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

60

NFA+Є automata• Є transitions can “fire” without reading the

input

Є

starta

b

c

Page 59: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

61

NFA+Є run example

c b a

Є

starta

b

c

Page 60: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

62

NFA+Є run example• Now Є transition can non-deterministically take

placec b a

Є

starta

b

c

Page 61: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

63

NFA+Є run example

c b a

Є

starta

b

c

Page 62: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

64

NFA+Є run example

c b a

Є

starta

b

c

Page 63: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

65

NFA+Є run example

c b a

Є

starta

b

c

Page 64: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

66

NFA+Є run example

c b a

• Word accepted

Є

starta

b

c

Page 65: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

67

From regular expressions to NFA

• Step 1: assign expression names and obtain pure regular expressions R1…Rm

• Step 2: construct an NFA Mi for each regular expression Ri

• Step 3: combine all Mi into a single NFA

• Ambiguity resolution: prefer longest accepting word

Page 66: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

68

From reg. exp. to automata• Theorem: there is an algorithm to build an

NFA+Є automaton for any regular expression• Proof: by induction on the structure of the

regular expression

start

Page 67: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

69

R =

R =

R = a a

Basic constructs

Page 68: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

70

CompositionR = R1 | R2 M1

M2

R = R1R2

M1 M2

Page 69: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

71

Repetition

R = R1*

M1

Page 70: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

72

What now?

• Naïve approach: try each automaton separately

• Given a word w:– Try M1(w)– Try M2(w)– …– Try Mn(w)

• Requires resetting after every attempt

Page 71: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

73

Combine automata

1 2aa

3a

4 b 5 b 6

abb

7 8b a*b+ba

9a

10 b 11 a 12 b 13

abab

0

aabb

a*b+abab

combines

Page 72: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

74

Ambiguity resolution

• Recall…• Longest word• Tie-breaker based on order of rules when words

have same length

• Recipe– Turn NFA to DFA– Run until stuck, remember last accepting state, this is

the token to be returned

Page 73: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

75

Corresponding DFA

0 1 3 7 9

8

7

b

a

a2 4 7 10

a

bb

6 8

5 8 11b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

b

Page 74: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

76

Examples

0 1 3 7 9

8

7

b

a

a2 4 7 10

a

bb

6 8

5 8 11b

12 13a b

b

abba*b+a*b+

a*b+

abab

a

b

abaa: gets stuck after aba in state 12, backs up to state (5 8 11) pattern is a*b+, token is ababba: stops after second b in (6 8), token is abb because it comes first in spec

Page 75: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

77

Summary of Construction

• Describes tokens as regular expressions – and decides which attributes (values) are saved for

each token• Regular expressions turned into a DFA

– describes expressions and specifies which attributes (values) to keep

• Lexical analyzer simulates the run of an automata with the given transition table on any input string

Page 76: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

78

A Few Remarks

• Turning an NFA to a DFA is expensive, but– Exponential in the worst case– In practice, works fine

• The construction is done once per-language– At Compiler construction time– Not at compilation time

Page 77: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

79

Implementation by Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 78: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

80

int edges[][256]= { /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... */

/* state 0 */ {0, …, 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0},/* state 1 */ {13, … , 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, …, 13, 13},/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, …, 0, 0},/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0},/* state 4 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, …, 0, 0}, /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0},/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0},/* state 7 *//* state … */ .../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}};

Page 79: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

81

Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal ;return action[lastFinal]; }

Page 80: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

82

Example

Input: “if --not-a-com”

Page 81: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

83

final state input

0 1 if --not-a-com

2 2 if --not-a-com

3 3 if --not-a-com

3 0 if --not-a-comreturn IF

Page 82: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

84

found whitespace

final state input

0 1 --not-a-com

12 12 --not-a-com

12 0 --not-a-com

Page 83: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

85

final state input

0 1 --not-a-com

9 9 --not-a-com

9 10 --not-a-com

9 10 --not-a-com

9 10 --not-a-com

9 0 --not-a-com

error

Page 84: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

86

final state input

0 1 -not-a-com

9 9 -not-a-com

9 0 -not-a-com

error

Page 85: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

87

Efficient Scanners

• Efficient state representation• Input buffering• Using switch and gotos instead of tables

Page 86: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

88

DFSM from Specification

• Create a non-deterministic automaton (NDFA) from every regular expression

• Merge all the automata using epsilon moves(like the | construction)

• Construct a deterministic finite automaton (DFA)– State priority

• Minimize the automaton starting – separate accepting states by token kinds

Page 87: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

89

NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }

Page 88: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

90

DFA Construction

Page 89: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

91

Minimization

Page 90: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

92

Lecture Outline

Role & place of lexical analysisWhat is a token?Regular languagesLexical analysis• Error handling• Automatic creation of lexical analyzers

Page 91: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

93

Errors in lexical analysis

pi = 3.141.562

txt

Illegal token

pi = 3oranges

txt

Illegal token

pi = oranges3

txt

<ID,”pi”>, <EQ>, <ID,”oranges3”>

Page 92: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

94

Error Handling• Many errors cannot be identified at this stage• Example: “fi (a==f(x))”. Should “fi” be “if”? Or is it a routine name?

– We will discover this later in the analysis– At this point, we just create an identifier token

• Sometimes the lexeme does not match any pattern – Easiest: eliminate letters until the beginning of a legitimate lexeme – Alternatives: eliminate/add/replace one letter, replace order of two adjacent

letters, etc.

• Goal: allow the compilation to continue• Problem: errors that spread all over

Page 93: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

95

Lecture Outline

Role & place of lexical analysisWhat is a token?Regular languagesLexical analysisError handling• Automatic creation of lexical analyzers

Page 94: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

96

Use of Program-Generating Tools

• Automatically generated lexical analyzer– Specification Part of compiler – Compiler-Compiler

Stream of tokens

JFlexregular expressions

input program scanner

Page 95: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

97

Use of Program-Generating Tools

• Input: regular expressions and actions • Action = Java code

• Output: a scanner program that• Produces a stream of tokens• Invoke actions when pattern is matched

Stream of tokens

JFlexregular expressions

input program scanner

Page 96: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

98

Line Counting Example

• Create a program that counts the number of lines in a given input text file

Page 97: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

99

Creating a Scanner using Flex

int num_lines = 0;%%\n ++num_lines;. ;%%main() { yylex(); printf( "# of lines = %d\n", num_lines);}

Page 98: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

100

Creating a Scanner using Flex

initial

other

newline\n

^\n

int num_lines = 0;%%\n ++num_lines;. ;%%main() { yylex(); printf( "# of lines = %d\n", num_lines);}

Page 99: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

101

JFLex Spec File User code: Copied directly to Java file%% JFlex directives: macros, state names%% Lexical analysis rules:

– Optional state, regular expression, action– How to break input to tokens– Action when token matched

Possible source of javac errors down

the road

DIGIT= [0-9]LETTER= [a-zA-Z]

YYINITIAL

{LETTER}({LETTER}|{DIGIT})*

Page 100: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

102

Creating a Scanner using JFlex

import java_cup.runtime.*;%%%cup%{ private int lineCounter = 0;%}

%eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF);%eofval}

NEWLINE=\n%%{NEWLINE} { lineCounter++; } [^{NEWLINE}] { }

Page 101: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

103

Catching errors

• What if input doesn’t match any token definition?

• Trick: Add a “catch-all” rule that matchesany character and reports an error– Add after all other rules

Page 102: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

104

A JFlex specification of C Scannerimport java_cup.runtime.*;%%%cup%{ private int lineCounter = 0;%}Letter= [a-zA-Z_]Digit= [0-9]%%”\t” { }”\n” { lineCounter++; }“;” { return new Symbol(sym.SemiColumn);}“++” { return new Symbol(sym.PlusPlus); }“+=” { return new Symbol(sym.PlusEq); }“+” { return new Symbol(sym.Plus); }“while” { return new Symbol(sym.While); }{Letter}({Letter}|{Digit})*

{ return new Symbol(sym.Id, yytext() ); }“<=” { return new Symbol(sym.LessOrEqual); }“<” { return new Symbol(sym.LessThan); }

Page 103: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

105

A Simplified Scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){ case ` `: goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; }; case `<`: … case `w`: … }

Page 104: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

106

Automatic Construction of Scanners

• Construction is done automatically by common tools • lex is your friend

– Automatically generates a lexical analyzer from declaration file• Advantages: short declaration file, easily checked, easily

modified and maintained

lexDeclaration file

LexicalAnalysi

s

characters tokens

Intuitively:• Lex builds DFA table• Analyzer simulates

(runs) the DFA on a given input

Page 105: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

107

Lecture Outline

Role & place of lexical analysisWhat is a token?Regular languagesLexical analysisError handlingAutomatic creation of lexical analyzers

Page 106: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

108

Start States

• It may be hard to specify regular expressions for certain constructs– Examples

• Strings• Comments

• Writing automata may be easier• Can combine both• Specify partial automata with regular expressions on the

edges– No need to specify all states– Different actions at different states

Page 107: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

109

Missing

• Creating a lexical analysis by hand• Table compression• Symbol Tables• Nested Comments• Handling Macros

Page 108: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

110

What does Lexical Analysis do?

• Input: program text (file)• Output: sequence of tokens

• Read input file• Identify language keywords and standard identifiers• Handle include files and macros• Count line numbers• Remove whitespaces• Report illegal symbols

• [Produce symbol table]

Page 109: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

111

Lexical Analysis – Impl.

• Lexical analyzer– Turns character stream into token stream– Tokens defined using regular expressions– Regular expressions -> NFA -> DFA construction

for identifying tokens– Automated constructions of lexical analyzer

using lex

Page 110: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

112

Automatic Construction of Scanners

• For most programming languages lexical analyzers can be easily constructed automatically– Exceptions:

• Fortran• PL/1

• Lex/Flex/Jlex/JFlex are useful tools

Page 111: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

113

Next Week

• Syntax analysis (aka parsing)

• Lecture in this room– Trubowicz 101 (Law school)

Page 112: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

114

Page 113: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

115

NFA vs. DFA

• (a|b)*a(a|b)(a|b)…(a|b)

n times

Automaton SPACE TIME

NFA O(|r|) O(|r|*|w|)

DFA O(2^|r|) O(|w|)

Page 114: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

116

From NFA+Є to DFA

• Construction requires O(n) states for a reg-exp of length n

• Running an NFA+Є with n states on string of length m takes O(m·n2) time

• Solution: determinization via subset construction– Number of states worst-case exponential in n– Running time O(m)

Page 115: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

117

Subset construction• For an NFA+Є with states M={s1,…,sk}• Construct a DFA with one state per set of states of the

corresponding NFA– M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

• Simulate transitions between individual states for every letter

as1 s2 a[s1,s4] [s2,s7]

NFA+Є DFA

as4 s7

Page 116: Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

118

Subset construction• For an NFA+Є with states M={s1,…,sk}• Construct a DFA with one state per set of states of the

corresponding NFA– M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

• Extend macro states by states reachable via Є transitions

Єs1 s4 [s1,s2] [s1,s2,s4]

NFA+Є DFA