lexical analyzer

80

Upload: princess-doll

Post on 18-Jun-2015

1.056 views

Category:

Education


3 download

DESCRIPTION

This Presentation is about the Topic Lexical Analyzer a phase in Structure of compiler (Phases of Compiler).

TRANSCRIPT

Page 1: Lexical analyzer
Page 2: Lexical analyzer

Welcome

ASSALAM-0-ALAIKUM!!

Page 3: Lexical analyzer

Topics:

Lexical Analyzer Specification of tokens Recognition of Tokens Data structures Involved in Lexical

Analysis.

Page 4: Lexical analyzer

Lexical Analysis

It involves……… The Role of the Lexical Analyzer

Tokens , Patterns, and LexemeAttributes for Tokens

Page 5: Lexical analyzer

Role of Lexical Analyzer

As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.

Page 6: Lexical analyzer

The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table

Page 7: Lexical analyzer

The call, suggested by the getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the next token, which it returns to the parser.

Page 8: Lexical analyzer

Sometimes, lexical analyzers are divided into a cascade of two processes: a) Scanning consists of the simple

processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

Page 9: Lexical analyzer

Token :When discussing lexical analysis, we use three

related but distinct terms: A token is a pair consisting of a token name and

an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit,

e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name.

Page 10: Lexical analyzer

Patterns

A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.

Page 11: Lexical analyzer

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.

2. One token representing all identifiers.

3. One or more tokens representing constants, such as numbers and literal strings .

4. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.

Page 12: Lexical analyzer

Lexemes:

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Page 13: Lexical analyzer

Strings and Languages . Operations on Languages Regular Expressions . . . Regular Definitions . . . . Extensions of Regular Expressions

SPECIFICATION OF TOKENS

Page 14: Lexical analyzer

Specification of Tokens

Regular expressions are an important notation for specifying lexeme patterns.

While they cannot express all possible patterns those types of patterns that we actually need for tokens.

Page 15: Lexical analyzer

expressions to automata that perform the recognition of the specified token patterns, they are very effective in specifying those types of patterns that we actually need for tokens.

we shall see how these expressions are used in a lexical-analyzer generator.

how to build the lexical analyzer by converting regular expressions to automata that perform the recognition of the specified tokens.

Page 16: Lexical analyzer

Strings and Languages An alphabet is any finite set of symbols.

Typical examples of symbols are letters, digits, and punctuation. The set {a, 1 } is the binary alphabet. ASCII is an important example of an alphabet ; it is used in many software systems.Uni-code, which includes approximately 100,000 characters from alphabets around the world, is another important example of an alphabet.

Page 17: Lexical analyzer

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theory, the terms "sentence" and "word" are often used as synonyms for "string."

The length of a string s, usually written | s| , is the number of occurrences of symbols in s. For example, banana is a string of length six.

The empty string, denoted t, is the string of length zero.

A language is any countable set of strings over some fixed alphabet.

Page 18: Lexical analyzer

Terms for Parts of Strings A prefix of string S is any string obtained

by removing zero or more symbols from the end of s. For example, ban, banana, and E are prefixes of banana.

A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and E are suffixes of banana.

Page 19: Lexical analyzer

A substring of s is obtained by deleting any prefix and any suffix from s. For instance, banana, nan, and E are substrings of banana.

The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not E or not equal to s itself.

A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana.

Page 20: Lexical analyzer

If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed by appending y to x. For example, if x = dog and y = house, then xy = doghouse. The empty string is the identity under concatenation;

This definition is very broad. Abstract languages like 0, the empty set, or {t} , the set containing only the empty string, are languages under this definition.

Page 21: Lexical analyzer

Operations on Languages

In lexical analysis, the most important operations on languages are union, concatenation, and closure, which are defined formally. Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways , and concatenating them.

Page 22: Lexical analyzer

The (Kleene) closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times. Note that L0 , the "concatenation of L zero times," is defined to be {ɛ} , and inductively, L 7 is L 7 ˉ ˡ L.

Finally, the positive closure, denoted L + ,is the same as the Kleene closure, but without the term L0 .

That is, ɛ will not be in L + unless it is in L itself.

Page 23: Lexical analyzer

 Regular Language

The set of regular languages over an alphabet  is defined recursively as below. Any language belonging to this set is a regular language over.

 

Page 24: Lexical analyzer

Definition of Set of Regular Languages : Basis Clause:  Ø, {ʌ} and {a} for any symbol a ϵ

  are regular languages. 

Inductive Clause: If Lr and Ls are regular languages, then Lr   Ls , LrLs and Lr

* are regular languages. 

Extremal Clause: Nothing is a regular language unless it is obtained from the above two clauses.

Page 25: Lexical analyzer

Regular expression Regular expressions are used to denote

regular languages. They can represent regular languages and operations on them succinctly. The set of regular expressions over an alphabet ∑ is defined recursively as below. Any element of that set is a regular expression. 

Basis Clause: Ø, ʌ , and a are regular expressions corresponding to languages Ø , {ʌ} and {a}, respectively, where a is an element of 

Page 26: Lexical analyzer

Inductive Clause: 

If r and s are regular expressions corresponding to languages Lr and Ls , then ( r + s ) , ( rs ) and ( r*) are regular expressions corresponding to languages Lr  Ls , LrLs and Lr

* , respectively. 

Extremal Clause: Nothing is a regular expression unless it is

obtained from the above two clauses.

Page 27: Lexical analyzer

Rules is a regular expression that denotes {}, the set

containing empty string. If a is a symbol in , then a is a regular expression

that denotes {a}, the set containing the string a. Suppose r and s are regular expressions denoting

the language L(r) and L(s), then (r) |(s) is a regular expression denoting L(r)L(s).(r)(s) is regular expression denoting L (r) L(s).(r) * is a regular expression denoting (L (r) )*.(r) is a regular expression denoting L (r).

Page 28: Lexical analyzer

Example of Regular Expressions Example: Let = { a, b }

The regular expression a|b denotes the set {a, b}

The regular expression (a|b)(a|b) denotes {aa, ab, ba, bb} the set of all strings of a’s and b’s of length two. Another regular expression for this same set is aa| ab| ba| bb.

The regular language a* denotes that the set

of all strings of zero or more a’s i.e… {ϵ,a, aa, aaa, aaaa, ……..}

Page 29: Lexical analyzer

Precedence Conventions

The unary operator * has the highest precedence and is left associative.

Concatenation has the second highest precedence and is left associative.

| has the lowest precedence and is left associative.

(a)|(b)*(c)a|b*c

Page 30: Lexical analyzer

The regular expression (a|b)* denotes the set of all strings containing zero or more instances of an a or b, that is the set of all strings of a’s and b’s. Another regular expression for this set is (a*b*)*

The regular expression a |a*b denotes the set containing the string a and all strings consisting of zero or more a’s followed by a b.

Page 31: Lexical analyzer

Properties of Regular Expression

Page 32: Lexical analyzer

where each di is a distinct name, and each ri

is a regular expression over the symbols in {d1,d2,…,di-1}, i.e.,

Note that each di can depend on all the previous d's.

Note also that each di can not depend on following d's. This is an important difference between regular definitions and productions.

Page 33: Lexical analyzer

Regular Definitions

These look like the productions of a context free grammar we saw previously, but there are differences. Let Σ be an alphabet, then a regular definition is a sequence of definitions

d1 → r1

d2 → r2

...

dn → rn

Page 34: Lexical analyzer

Example: C identifiers can be described by the following regular definition

letter_ → A | B | ... | Z | a | b | ... | z | _

digit → 0 | 1 | ... | 9

CId → letter_ ( letter_ | digit)*

Example: Unsigned numbers digit → 0| 1 | 2 | … | 9

digit → digit digit*optional_fractional → digit | ϵ

optional_exponent → (E ( + | - | ϵ ) digit ) | ϵ

num → digit optional_fraction optional_exponent

Page 35: Lexical analyzer

Extensions of Regular Expressions Many extensions have been added to regular expressions

to enhance their ability to specify string patterns. Here we mention a few notational extensions that were first incorporated into unix utilities such as Lex that are particularly useful in the specification lexical analyzers .

One or more instances. The unary, postfix operator + represents the positive closure of a regular expression and its language. That is , if r is a regular expression, then (r)+denotes the language (L( r) ) +. The operator + has the same precedence and associativity as the operator *. Two useful algebraic laws, r* = r+ I ϵ and r+ = rr* = r*r relate the Kleene closure and positive closure.

Page 36: Lexical analyzer

Zero or one instance. The unary postfix

operator ? means "zero or one occurrence ." That is , r? is equivalent to r lϵ , or put another way, L(r?) = L(r) U {ϵ}. The ? operator has the same precedence and associativity as * and +.

Example

digit → 0| 1 | 2 | … | 9

digit → digit digit+

optional_fractional → ( . digit ) ?

optional_exponent → (E ( + | - ) ? digit ) ?

num → digit optional_fraction optional_exponent

Page 37: Lexical analyzer

Character classes. A regular expression a1 la2|

···| an , where the ai ' s are each symbols of the alphabet , can be replaced by the shorthand [a la2 ··· an] . More importantly, when a 1 , a2 ,.·· ,an form a logical se quence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a I -an , that is , just the first and last separated by a hyphen. Thus, [abc] is shorthand for al b i c, and [a- z] is shorthand for al b l··· l z .

Example

[A-Z a-z][A-Z a-z 0-9]*

Page 38: Lexical analyzer

RECOGNITION OF TOKEN

Page 39: Lexical analyzer

Lexical analyzer

Also called a scanner or tokenizer Converts stream of characters into a

stream of tokens Tokens are:

Keywords such as for, while, and class.

Special characters such as +, -, (, and <Variable name occurrencesConstant occurrences such as 1, 0, true.

Page 40: Lexical analyzer

Phase Input Output

Lexer Sequence of characters

Sequence of tokens

Parser Sequence of tokens

Parse tree

Comparison with Lexical Analysis

Page 41: Lexical analyzer

The role of lexical analyzer

Lexical Analyzer ParserSourceprogram

token

getNextToken

Symboltable

To semanticanalysis

Page 42: Lexical analyzer

What are Tokens For? Parser relies on token classification

e.g., How to handle reserved keywords? As an identifier or a separate keyword for each?

Output of the lexer is a stream of tokens which is input to the parser.

The lexer usually discards “uninteresting” tokens that do not contribute to parsing Examples: white space, comments

Page 43: Lexical analyzer

Recognition of TokensTask of recognition of token in a lexical analyzer

Isolate the lexeme for the next token in the input buffer

Produce as output a pair consisting of the appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page

Page 44: Lexical analyzer

Recognition of Tokens

Task of recognition of token in a lexical analyzer

Regular expression

Token Attribute-value

if if -id id Pointer to

table entry< relop LT

Page 45: Lexical analyzer

Recognition of tokens Starting point is the language grammar to

understand the tokens:stmt -> if expr then stmt

| if expr then stmt else stmt

| Ɛ

expr -> term relop term

| term

term -> id

| number

Page 46: Lexical analyzer

Recognition of tokens (cont.) The next step is to formalize the patterns:

digit -> [0-9]

Digits -> digit+

number -> digit(.digits)? (E[+-]? Digit)?

letter -> [A-Za-z_]

id -> letter (letter|digit)*

If -> if

Then -> then

Else -> else

Relop -> < | > | <= | >= | = | <> We also need to handle whitespaces:

ws -> (blank | tab | newline)+

The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as used by the lexical analyzer.

The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" ws .

Page 47: Lexical analyzer

Compiler Construction

Tokens, their patterns, and attribute values

Page 48: Lexical analyzer

LEXICAL ANALYSIS Recognition of Tokens

Method to recognize the token Use Transition Diagram

Page 49: Lexical analyzer

Diagrams for Tokens Transition Diagrams (TD) are used to represent the

tokens

Each Transition Diagram has:

States: Represented by Circles

Actions: Represented by Arrows between states

Start State: Beginning of a pattern (Arrowhead)

Final State(s): End of pattern (Concentric Circles)

Deterministic - No need to choose between 2 different actions

Page 50: Lexical analyzer

Recognition of TokensTransition Diagram(Stylized flowchart)

Depict the actions that take place when a lexical analyzer is called by the parser to get the next token

start 0 6 7

8

return(relop,GE)

return(relop,GT)*

> =

otherStart state

Accepting state

Notes: Here we use ‘*’ to indicate states on which input

retraction must take place

Page 51: Lexical analyzer

if it is necessary to retract the forward pointer one position ( i.e., the lexeme does not include the symbol that got us to the accepting state) , then we shall additionally place a *near that accepting state.

Page 52: Lexical analyzer

Ex :RELOP = < | <= | = | <> | > | >=We begin in state 0, the start state < as the first input symbol, then among the lexemes that match the pattern for relop we can only be looking at <, <>, or <=.

Page 53: Lexical analyzer

Compiler Construction

Recognition of Identifiers

Ex2: ID = letter(letter | digit) *

9 10 11start letter

return(id)

# indicates input retraction

other#

letter or digitTransition Diagram:

Page 54: Lexical analyzer

Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these strings are never ordinary identi fiers, and tells which token they represent . We have supposed that this method is in use in Fig . When we find an identifier, a call to install ID places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found Of course, any identifier not in the symbol table during lexical analysis cannot be a reserved word, so its token is id.

Page 55: Lexical analyzer

COMPLETION OF THE RUNNING EXAMPLE

Page 56: Lexical analyzer

So far we have seen the transition diagrams for identifiers and the relational operators.

What remains are: WhitespaceNumbers

Page 57: Lexical analyzer

Recognizing white spaces:We also want the laxer to remove

whitespace so we define a new token .

{ ws → ( blank | tab | newline ) + }

where blank, tab, and newline are symbols used to represent the corresponding ascii characters.

Page 58: Lexical analyzer

Transition diagram for white space:

Page 59: Lexical analyzer

The “delim” in the diagram represents any of the whitespace characters, say space, tab, and newline.

The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token.

There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token.

Page 60: Lexical analyzer

Accepting integer e.g. 12

Accepting float e.g. 12.31

Accepting float e.g. 12.31E4

Transition Diagram for Numbers :

Page 61: Lexical analyzer

Explaination with example : Beginning in state 12, if we see a digit, we go

to state 13. In that state, we can read any number of additional digits . However, if we see anything but a digit or a dot , we have seen a number in the form of an integer; 12 is an example .

Accepting integere.g. 12

Page 62: Lexical analyzer

If we instead see a dot in state 13, then we have an "optional fraction ."

State 14 is entered, and we look for one or more additional digits; state 15 is used for that purpose.

Page 63: Lexical analyzer

If we see an E, then we have an "optional exponent ," whose recognition is the job of states 16 through 19.

Page 64: Lexical analyzer

Architecture of a Transition-Diagram- Based Lexical Analyzer : Each state is represented by a piece of

code . We may imagine a variable state holding the number of the current state for a transition diagram. A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state. Often, the code for a state is itself a switch statement or multi way branch that determines the next state by reading and examining the next input character.

Page 65: Lexical analyzer

Transition diagram for relop:

Page 66: Lexical analyzer

Sketch of implementation of relop transition diagram

Page 67: Lexical analyzer

getRelop ( ), a C++ function whose job is to simulate the transition diagram and return an object of type TOKEN , that is , a pair consisting of the token name (which must be relop in this case) and an attribute value .

getRelop ( ) first creates a new object ret Token and initializes its first component to RELOP , the symbolic code for token relop

Page 68: Lexical analyzer

We see the typical behavior of a state in case 0, the case where the current state is 0. A function nextChar ( ) obtains the next character from the input and assigns it to local variable c. We then check c for the three characters we expect to find , making the state transition dictated by the transition diagram .

For example, if the next input character is =, we go to state 5.

Page 69: Lexical analyzer

If the next input character is not one that can begin a comparison operator, then a function fail ( ) is called, and It should reset the forward pointer to lexemeBegin.

Page 70: Lexical analyzer

Because state 8 bears a *, we must retract the input pointer one position (i.e. , put c back on the input stream) . That task is accomplished by the function retract ( ). Since state 8 represents the recognition of lexeme >=, we set the second component of the returned object , which we suppose is named attribute, to GT, the code for this operator.

Page 71: Lexical analyzer

DATA STRUCTURES INVOLVED IN LEXICAL ANALYSIS

Page 72: Lexical analyzer

DEFINITION

“The symbol table is a data structure used by all phases of the compiler to keep track of user defined symbols and keywords.”

Page 73: Lexical analyzer

In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source, such as its type, scope level and sometimes its location.

Page 74: Lexical analyzer

During early phases (lexical and syntax analysis) symbols are discovered and put into the symbol table.

During later phases symbols are looked up to validate their usage.

Page 75: Lexical analyzer

Uses An object file will contain a symbol table of the

identifiers it contains that are externally visible. During the linking of different object files, a linker will use these symbol tables to resolve any unresolved references.

A symbol table may only exist during the translation process, or it may be embedded in the output of that process for later exploitation, for example, during an interactive debugging session, or as a resource for formatting a diagnostic report during or after execution of a program.

Page 76: Lexical analyzer

While reverse engineering an executable, many tools refer to the symbol table to check what address has been assigned to global variables & Non-functions if the symbol table has been stripped or cleaned out before being converted into executable tools will find it harder to determining address or understand anything about the program.

At that time of accessing variable & allocating memory dynamically a compiler should perform many works & as such, the extended stack model requires the symbol table.

Page 77: Lexical analyzer

Typical symbol table activities:

Add a new name Add information for a name Access information for a name determine if a name is present in the table remove a name revert to a previous usage for a name (close a

scope).

Page 78: Lexical analyzer

Many possible Implementations

Linear List Sorted List Hash Table Tree Structure

Page 79: Lexical analyzer

Typical information fields:

print value kind

(e.g. reserved, type_id, var_id, func_id, etc.) block number/level number type initial value base address etc.

Page 80: Lexical analyzer