lexical analysis - southerncs.southern.edu/halterman/courses/spring2008/415/slides/...lexical...

70
Lexical Analysis Chapter 3 Compiler Construction Lexical Analysis 1

Upload: others

Post on 23-Sep-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lexical Analysis

Chapter 3

Compiler Construction Lexical Analysis 1

Page 2: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lexical Analyzers

• Also called lexers or scanners

• Recognize tokens that make up the source language

• Examples, lex, AWK, Perl

• Have applications in areas other than traditional language processing

– Defect detection in circuit boards

Compiler Construction Lexical Analysis 2

Page 3: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Role of the Lexical Analyzer

• First phase of compilation

• Input to parser is the output of the lexer

parserlexer

symboltable

sourcecode

token

get next token

Compiler Construction Lexical Analysis 3

Page 4: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Activity of Lexical Analyzer

• Extracts tokens from the source code and sends them to the parser

• May discard some “tokens” (like white space)

• May record some statistics

• Likely records line numbers for useful error messages (count newlinecharacters)

Compiler Construction Lexical Analysis 4

Page 5: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Why Separate Lexing from Parsing?

• Divide and conquer strategy

– Simpler design of both parts– The scanner can filter comments and whitespace; a parser can be made simpler

since it is not expected to see comments and whitespace built to properly handlethese excluded

• Can improve efficiency

– Specialized buffering techniques

• Improves compiler portability

– Separate lexical analyzer can handle special characters (e.g., ASCII vs. Unicode)– Character anomalies can be isolated to the scanner

Compiler Construction Lexical Analysis 5

Page 6: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Terminology

Token vs. pattern vs. lexeme

• Token: represents a set of strings in the source language

• Pattern: a rule associating the set of strings with a token

• Lexeme: a sequence of characters in the source code that matches apattern to produce a token

Compiler Construction Lexical Analysis 6

Page 7: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Examples

Token Lexeme Pattern

if if ifidentifier pi, x2, total (letter)(letter | digit)*number 23, 5, 44 (digit)(digit)*

Compiler Construction Lexical Analysis 7

Page 8: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Tokens

• Terminal symbols in the grammar of the language

• A token is treated as a unit

• The book gives some examples of difficulties in parsing some (ancient)languages (see Pages 86, 87)

Compiler Construction Lexical Analysis 8

Page 9: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Token Attributes

While both 22 and 5 are lexemes that match the pattern for number, it isimportant for the code to know which value to use

li $a0 22sw $a0 4($sp)

Compiler Construction Lexical Analysis 9

Page 10: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Token Attributes

• The actual lexeme responsible for a token may stored as an attribute ofthe token

• The lexeme may be stored in a symbol table, and a pointer to thislexeme may be stored as an attribute of the token

Compiler Construction Lexical Analysis 10

Page 11: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lexical Errors

• The lexical analyzer has limited ability to detect errors

• Consider the C/C++ code:

fi ( a == f(x) )

• Programmer meant if?

• Call to method fi()?

Compiler Construction Lexical Analysis 11

Page 12: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lexical Errors

• Lexical errors are possible

– The prefix of the remaining input does not match a pattern for any token

• What to do?

– Bail out entirely– Discard input until recovery is possible– Try to fix∗ Add missing character∗ Transpose adjacent characters

Compiler Construction Lexical Analysis 12

Page 13: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Token Specification

• Strings

• Languages

• Regular expressions

• Regular definitions

Compiler Construction Lexical Analysis 13

Page 14: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Strings

• Alphabet: a finite set of symbols

– Normally characters of some character set– E.g., ASCII, Unicode– Σ is used to represent an alphabet

• String: a finite sequence of symbols from some alphabet

– If s is a string, then |s| is its length– The empty string is symbolized by ε

Compiler Construction Lexical Analysis 14

Page 15: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

String Operations

Concatenation

• x = hi, y = bye −→ xy = hibye

• sε = s = εs

si =

ε , if i = 0

si−1s , if i > 0

Compiler Construction Lexical Analysis 15

Page 16: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Parts of a String

• Prefix

• Suffix

• Substring

• Proper prefix, suffix, or substring

• Subsequence

Compiler Construction Lexical Analysis 16

Page 17: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Language

• A language is a set of strings over some alphabet

L⊆ Σ∗

• Examples:

– ∅ is a language– {ε} is a language– The set of all legal Java programs– The set of all correct English sentences

Compiler Construction Lexical Analysis 17

Page 18: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Operations on Languages

Of most concern for lexical analysis

• Union

• Concatenation

• Closure

Compiler Construction Lexical Analysis 18

Page 19: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Union

The union of languages L and M

L∪M = {s | s ∈ L or s ∈M}

Compiler Construction Lexical Analysis 19

Page 20: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Concatenation

The concatenation of languages L and M

LM = {st | s ∈ L and t ∈M}

Compiler Construction Lexical Analysis 20

Page 21: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Kleene Closure

The Kleene closure of language L

L∗ =∞

[

i = 0

Li

Zero or more concatenations

Compiler Construction Lexical Analysis 21

Page 22: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Positive Closure

The positive closure of language L

L+ =∞

[

i = 1

Li

One or more concatenations

Compiler Construction Lexical Analysis 22

Page 23: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Example

• Let L = {A,B,C, . . . ,Z,a,b,c, . . . ,z}

• Let D = {0,1,2, . . . ,9}

L∪D LD

L4 L∗

L(L∪D)∗ D+

Compiler Construction Lexical Analysis 23

Page 24: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Regular Expressions

• A convenient way to represent languages that can be processed bylexical analyzers

• Notation is slightly different than the set notation presented forlanguages

• A regular expression is built from simpler regular expressions using aset of defining rules

• A regular expression represents strings that are members of someregular set

Compiler Construction Lexical Analysis 24

Page 25: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Rules for Defining Regular Expressions

• The regular expression r denotes the language L(r)

• ε is a regular expression that denotes {ε}, the set containing the emptystring

• If a is a symbol in the alphabet, then a is a regular expression thatdenotes {a}, the containing the string a

• How to distinguish among these notations

Compiler Construction Lexical Analysis 25

Page 26: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Combining Regular Expressions

• Let r and s be regular expressions that denote the languages L(r) andL(s) respectively

(r)|(s) is a regular expression denoting L(r)∪L(s)(r)(s) is a regular expression denoting L(r)L(s)(r)∗ is a regular expression denoting (L(r)∗)(r) is a regular expression denoting L(r)

• The language denoted by a regular expression is called a regular set

Compiler Construction Lexical Analysis 26

Page 27: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

More Formallya ∈ Σ

E and F are regular expressions

L(∅) = ∅

L(ε) = {ε}

L(a) = {a}L(EF) = {ab | a ∈ L(E) and b ∈ L(F)}

L(E | F) = L(E)∪L(F)

L((E)) = L(E)

L(E∗) = L(E)∗

Compiler Construction Lexical Analysis 27

Page 28: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Precedence Rules

• Precedence rules help simplify regular expressions

– Kleene closure has highest precedence– Concatenation has next highest– | has lowest precedence

• All operators associate left-to-right

Compiler Construction Lexical Analysis 28

Page 29: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Example

• Let Σ = {a,b}

• Find the strings in the language represented by the following regularexpressions:

a | b (a | b)(a | b)

a∗ (a | b)∗

a | a∗b a(a | b)∗a

Compiler Construction Lexical Analysis 29

Page 30: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Algebra of Regular Expressions

Property Definition

| is commutative r | s = s | r| is associative (r | s) | t = r | (s | t)Concatenation is associative (rs)t = r(st)Concatenation distributes over | r(s | t) = rs | rt

(s | t)r = sr | trε is the identity element for concatenation εr = r = rεRelation between ∗ and ε (r | ε)∗ = r∗

∗ is idempotent r∗∗ = r∗

Compiler Construction Lexical Analysis 30

Page 31: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Mathematically Describing Relational Operators

Σ = { <, >, =, ! }

relop = < | > | <= | >= | == | !=

Compiler Construction Lexical Analysis 31

Page 32: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Identifiers and Numbers

Σ = { a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r,s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J,K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 0, 1,2, 3, 4, 5, 6, 7, 8, 9, _ }

letter = a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p| q| r|s| t| u| v| w| x| y| z| A| B| C| D| E| F| G| H| I| J|K| L| M| N| O| P| Q| R| S| T| U| V| W| X| Y| Z|

digit = 0| 1| 2| 3| 4| 5| 6| 7| 8| 9

identifier = letter ( letter | digit)∗

number = digit digit∗

Compiler Construction Lexical Analysis 32

Page 33: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Finite Automata

A non-deterministic finite automaton (NFA) is a 5-tuple:

〈S,Σ,φ,s0,F〉

• S a set of states

• Σ a set of input symbols

• φ a transition function (S,Σ)−→ S

• s0 a distinguished state called the start state

• F a set of accepting or final states

Compiler Construction Lexical Analysis 33

Page 34: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFA Representation

An NFA can be conveniently represented by both a directed graph and atable

10

c

2 3

a

c

c

a, c

ab, ca

b

Current Next StateState a b c Output

0 { 0, 2 } – 3 01 – 2 0 12 2 – {1, 2} 03 1 0 0 1

Final states

• are double circled (graph)

• output a 1 (table)

Compiler Construction Lexical Analysis 34

Page 35: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFA Transition Graphs

0 1

0 1 2 3

a

a

b

b b

l

l, d

Compiler Construction Lexical Analysis 35

Page 36: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Another NFA

2 3

4

0

b

b

aa

5

Compiler Construction Lexical Analysis 36

Page 37: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFAs and Regular Sets

• An NFA can be built to recognize strings represented by a regularexpression

(i.e., strings that are members of some regular set)

2 3

4

0

b

b

aa

5

Compiler Construction Lexical Analysis 37

Page 38: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFAs as Recognizers

• Given an NFA M, L(M) is the language recoginized by that machine

• If the NFA scans the complete string and ends in a final state, then thestring is a member of L(M)

We say M accepts the string

• If the NFA scans the complete string and ends in a non-final state, thenthe string is not a member of L(M)

We say M rejects the string

• Because of non-determinism a string is accepted if there is a path to afinal state; a string is rejected if there is no path to a final state

Think about the NFA following all non-deterministic paths in parallel

Compiler Construction Lexical Analysis 38

Page 39: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Deteministic Finite Automata (DFA)

• A special case of an NFA

• Also called a finite state machine

• No state has an ε-transition

• ∀s ∈ S and ∀a ∈ Σ, there is at most one edge labeled a leaving s

l, d

0 1l

Current Next State

State l d Output

0 1 – 0

1 1 1 1

Compiler Construction Lexical Analysis 39

Page 40: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA Simulation

DFA() {s← s0;c← nextchar();while c 6= eof {

s← move(s, c); —move is the φ : (S,Σ)→ S function

c← nextchar();}

if s ∈ F {return true;

}

return false;}

Compiler Construction Lexical Analysis 40

Page 41: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

ε-closure

• If s ∈ S, then ε-closure(s) is the set of states reachable from state susing only ε-transitions

• If V ⊆ S, then ε-closure(V ) is the set of states reachable from somestate s ∈V using only ε-transitions

Compiler Construction Lexical Analysis 41

Page 42: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

ε-closure ComputationStateSet ε-closure(StateSet T ) {

result← T ; stack← ∅; —stack is a stack of states

for all s ∈ T do {stack.push(s);

}

while stack 6= ∅ {

t ← stack.pop();for each state u with an edge from t to u labeled ε do

if u /∈ result {result← result ∪ u;stack.push(u);

}

}

return result;}

Compiler Construction Lexical Analysis 42

Page 43: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFA Simulation

NFA() {V ← ε-closure({s0});c← nextchar();while c 6= eof {

—move here returns the set of states to which there is a

—transition on input symbol c from some state s ∈V

V ← ε-closure(move(V , c));c← nextchar();

}

if V ∩ F 6= ∅ {

return true;}

return false;}

Compiler Construction Lexical Analysis 43

Page 44: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Regular Expression −→ NFA

• There are several strategies to build an NFA from a regular expression

• Your book provides Thompson’s method (p. 122)

1. Parse the regular expression into its basic subexpressions– ε is a basic expression– an alphabet symbol is a basic expression

2. Create primitive NFAs for these subexpressions3. Guided by the regular expression operators and parentheses,

inductively combine the sub-NFAs into the composite NFArepresenting the complete regular expression

• This is a syntax-directed approach

Compiler Construction Lexical Analysis 44

Page 45: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Basic Expression −→ Primitive NFA

For ε, the NFA is

fistart

For a ∈ Σ, the NFA is

fiastart

Observe that both of these NFAs have exactly one start state and one finalstate

Compiler Construction Lexical Analysis 45

Page 46: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

s | t

If N(s) is the NFA for regular expression s, and N(t) is the NFA for regularexpression t, then N(s | t) is

fi

N(t)

N(s)

start

Compiler Construction Lexical Analysis 46

Page 47: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

st

If N(s) is the NFA for regular expression s, and N(t) is the NFA for regularexpression t, then N(st) is

i f

N(t)start

N(s)

Compiler Construction Lexical Analysis 47

Page 48: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

s∗

If N(s) is the NFA for regular expression s, then N(s∗) is

i fstart

N(s)

∋ ∋

Compiler Construction Lexical Analysis 48

Page 49: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

(s)

If N(s) is the NFA for regular expression s, then N((s)) = N(s) is

N(s)

Compiler Construction Lexical Analysis 49

Page 50: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

NFA −→ DFA

• NFAs are difficult to simulate in a computer program

Non-determinism on a deterministic machine

• Fortunately, any NFA can be converted into an equivalent DFA

– A process known as subset construction is used to create the DFA– Each state in the DFA is derived from the subset of the states in the NFA– If the NFA has n states, its corresponding DFA may have up to 2n states

Fortunately, this theoretical maximum is rare in practice

Compiler Construction Lexical Analysis 50

Page 51: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Subset Construction

NFAtoDFA() {E ← ε-closure({s0}); E.mark← false; D← {E};while ∃ T ∈ D such that T .mark = false do {

T .mark← true;for each a ∈ Σ do {

U ← ε-closure(move(T , a));if U /∈ D {

U .mark← false;D← D ∪U ;

}

DTran[T ][a]←U ;}

}

}

Compiler Construction Lexical Analysis 51

Page 52: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA MinimizationGoal: Given a DFA M, find a DFA M′ such that M′ exhibits the sameexternal behavior as M, but M′ has fewer states than M

Reason: M′ will be simpler and more efficient

3

4

1

2 b0 a

bb

a

b

a

b

a

a

Current Next StateState a b Output

0 2 1 11 2 0 12 4 3 03 2 3 14 0 1 0

Compiler Construction Lexical Analysis 52

Page 53: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA Minimization Procedure

1. Remove states unreachable from the start state

2. Ensure that all states have a transition on every input symbol (i.e., everyelement of Σ)

• Introduce a new “dead state” d if necessary• ∀a ∈ Σ, φ(d,a) = d (i.e., move(d, a) = d, for all a)• ∀s ∈ S, if ∃a such that φ(s,a) is undefined, define φ(s,a) = d

3. Collapse equivalent states into a single, representative state

Compiler Construction Lexical Analysis 53

Page 54: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Equivalent States

• We say string w distinguishes state s from state t if

1. starting DFA M in state s and feeding it string w we arrive at anaccepting state, and

2. starting DFA M in state t and feeding it string w we arrive at an non-final state

or vice-versa

• w = ε distinguishes any final state from any non-final state

• We must find all sets of states that can be distinguished by some inputstring

• Two states that cannot be distinguished by any input string are calledequivalent states

Compiler Construction Lexical Analysis 54

Page 55: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA Minimization Algorithm (1)

DFA minimize(DFA M) {Part 1: Find equivalent states

Σ← M.Σ; M’s alphabetS← M.S; M’s statesF ← M.F ; M’s final statesφ← M.φ; M’s transition functionΠ← {F,S−F}; Partition states into two blocks: final and non-final statesΠold← ∅;

Iteratively partion the blocks until no further partitioning occurswhile Π 6= Πold {

Πold← Π;for each block B ∈Π do {

Partition B into sub-blocks B1,B2, . . . ,Bk such that two states s and tare in the same sub-block iff ∀a ∈ Σ states s and thave transitions on a to states in the same block of Π;

Π← (Π−B)∪{B1,B2, . . . ,Bk}}

}

Compiler Construction Lexical Analysis 55

Page 56: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA Minimization Algorithm (2)

Part 2: Build near-minimal DFAM′.Σ← Σ; M′.S← ∅; M′.F ← ∅; M′.φ← ∅;for each block B ∈Π do { Basically a block in Π becomes a state in M ′

Choose one state s in B to be the representative of that block;M′.S← M′.S∪ s;

}for each state s ∈M′.S do { Construct in the transition function for M ′

for each a ∈ Σ do {if φ(s,a) = t {

M′.φ(s,a)← t ′ ∈M′.S such that t ′

is the representative state of the block in Π that contains t;}

}The start state of M′ is the respresentative state of the block in Π that contains

the start state of M;for each state s ∈M′.S do { Assign final states

if s ∈ F { M′.F ← M′.F ∪ s; }}

Compiler Construction Lexical Analysis 56

Page 57: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

DFA Minimization Algorithm (3)

Part 3: Remove superfluous statesif M′.S contains a dead state d { Remove any dead states

M′.S← M′.S−d;for all s ∈M′.S do {

if ∃a ∈ Σ such that M′.φ(s,a) = d {M′.φ(s,a)← undefined;

}}

for all s ∈M′.S do { Prune unreachable statesif s is unreachable from the start state in M ′ {

M′.S← M′.S− s;}

}return M′; The minimized DFA

}

Compiler Construction Lexical Analysis 57

Page 58: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Minimization Example

Current Next StateState a b Output

0 2 1 11 2 0 12 4 3 03 2 3 14 0 1 0

• a transitions are in red

• b transitions are in blue

Π3 = {{ 2},{4},{0,1,3}}

Π2 = {{ 2},{4},{0,1,3}}

Π1 = {{ 2,4},{0,1,3}}

Π2 Π3=

Compiler Construction Lexical Analysis 58

Page 59: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Minimal DFA

Π3 = {{ 2},{4},{0,1,3}}

Π2 = {{ 2},{4},{0,1,3}}

Π1 = {{ 2,4},{0,1,3}}

Π2 Π3=

Current Next StateState a b Output

0′ 2′ 0′ 12′ 4′ 0′ 04′ 4′ 0′ 0

• a transitions are in red

• b transitions are in blue

• {0,1,3} ⇒ state 0′ in M′

• {2} ⇒ state 2′ in M′

• {4} ⇒ state 4′ in M′

Compiler Construction Lexical Analysis 59

Page 60: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

FAs and Regular Expressions

If L⊆ Σ∗ is a language, the following four statements are equivalent:

1. L is a regular language

2. L can be represented by a regular expression

3. L is accepted by some NFA

4. L is accepted by some DFA

Compiler Construction Lexical Analysis 60

Page 61: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex/Flex Program

• Used to generate scanners

• Developed by Lesk and Schmidt, AT&T Bell Labs

• Originally for C under Unix, but other platforms are supported

• GNU Flex is the modern version that we will use

We’ll just call it Lex, though

Compiler Construction Lexical Analysis 61

Page 62: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex Specification

%{

C/C++ Declarations%}

Lex Definitions%%

Rules%%

Programmer functions

Compiler Construction Lexical Analysis 62

Page 63: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex Specification (2)

%{

C/C++ Declarations%}

Lex Definitions%%

Rules%%

Programmer functions

1. C/C++ macros and declarations are placed in the C/C++ declarations section

2. Lex definitions are placed in the Lex definitions section

3. Actions to perform when patterns are matched are placed in the rules section

4. Arbitrary C/C++ code is placed in the programmer functions section

Compiler Construction Lexical Analysis 63

Page 64: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex Rules

• Consist of a (pattern, action) pair

• Pattern: any regular expression

• Action: any valid C/C++ code

• The regular expression must be expressed expressed in ASCII

• The action may reference the special Lex indentifiers (yytext,yylineno, etc.; see next slide)

Compiler Construction Lexical Analysis 64

Page 65: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Special Lex Identifers

• yylex() the scanner function

• yytext the current lexeme (token) being scanned

• yyleng the number of charecters in the current lexeme

• yylineno the current line number

Compiler Construction Lexical Analysis 65

Page 66: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex Regular Expression Syntax

Pattern Meaninga Character “a”

"a" or \a Character “a”, even if it is a special Lex operator (like +). Any character except “\n”

[ab] Character a or character b[a−b] Characters in the range a . . .bˆ[a] Any character except aˆa a at the beginning of the line

a\$ a at the end of the linea+ One or more asa* Zero or more asa? Optionally aa|b a or b(a) a (a possibly complex expression) treated as a unita/b a, but only if followed by b{def} Translation of def from the Lex definitions section

Compiler Construction Lexical Analysis 66

Page 67: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Example Lex Specification

/* Simple Lex specification */%{ /* C declarations and macros section */

#include <stdio.h>

%}

/* Lex definitions section */

WHITESPACE [ \t]digit [0-9]letter [a-z]

%% /* Rules section */

{digit}+ { printf("[number=%s]", yytext); }{letter}({letter}|{digit})* { printf("[identifier=%s]", yytext); }{WHITESPACE} {} /* Ignore whitespace */\$ { exit(0); }. { printf("[unknown=%s]", yytext); }

%%

/* C defintions and declarations *//* None given */

Compiler Construction Lexical Analysis 67

Page 68: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Lex Specification to Scanner

yylex();main() {C procedures

lex.yy.c

}

Declarations%%Transition rules%%

prog.l

NFARegular

Expressions States Table

yylex()DFA

Compiler Construction Lexical Analysis 68

Page 69: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Build Process

Transition rules%%C procedures

%%

yylex();}

Declarations

lexmain() {

gccprog.l lex.yy.c prog

gcc −o prog lex.yy.clex prog.l

Compiler Construction Lexical Analysis 69

Page 70: Lexical Analysis - Southerncs.southern.edu/halterman/Courses/Spring2008/415/Slides/...Lexical Analyzers Also called lexers or scanners Recognize tokens that make up the source language

Limitations of Regular Languages

• Build a DFA to recognize

L = L(0∗1∗)

• Build a DFA to recognize

L = {0n1n | n ∈ N}

• Not all languages are regular

• See the Pumping Lemma

Compiler Construction Lexical Analysis 70