lexical analysis1

Upload: kailash-borse

Post on 05-Apr-2018

252 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Lexical Analysis1

    1/44

    1

    Lexical Analysis Recognize tokens and ignore white spaces,

    comments

    Error reporting

    Model using regular expressions

    Recognize using Finite State Automata

    Generates token stream

  • 8/2/2019 Lexical Analysis1

    2/44

    2

    Lexical Analysis Sentences consist of string of tokens (a syntactic category)

    for example number, identifier, keyword, string

    Sequences of characters in a token is lexemefor example 100.01, counter, const, How are you?

    Rule of description is patternfor example letter(letter/digit)*

    Discard whatever does not contribute to parsing like white spaces(blanks, tabs, newlines) and comments

    construct constants: convert numbers to token num and passnumber as its attribute for example integer 31 becomes

    recognize keyword and identifiersfor example counter = counter + incrementbecomes id = id + id /*check if id is a keyword*/

  • 8/2/2019 Lexical Analysis1

    3/44

    3

    Interface to other phases

    Push back is required due to lookaheadfor example > = and >

    It is implemented through a buffer Keep input in a buffer Move pointers over the input

    LexicalAnalyzer

    SyntaxAnalyzer

    InputAsk fortoken

    TokenRead

    characters

    Push backExtra

    characters

  • 8/2/2019 Lexical Analysis1

    4/44

    4

    Approaches to implementation

    Use assembly languageMost efficient but most difficult toimplement

    Use high level languages like CEfficient but difficult to implement

    Use tools like lex, flexEasy to implement but not as efficient asthe first two cases

  • 8/2/2019 Lexical Analysis1

    5/44

    5

    Construct a lexical analyzer Allow white spaces, numbers and arithmetic

    operators in an expression

    Return tokens and attributes to the syntaxanalyzer

    A global variable tokenval is set to thevalue of the number

    Design requires that A finite set of tokens be defined Describe strings belonging to each token

  • 8/2/2019 Lexical Analysis1

    6/44

    6

    #include #include int lineno = 1;int tokenval = NONE;int lex() {

    int t;while (1) {t = getchar ();if (t == || t == \t);

    else if (t == \n) lineno = lineno + 1;else if (isdigit (t) ) {

    tokenval = t 0 ;t = getchar ();while (isdigit(t)) {

    tokenval = tokenval * 10 + t 0 ;

    t = getchar();

    }ungetc(t,stdin);return num;

    }else { tokenval = NONE; return t; }

    }

    }

  • 8/2/2019 Lexical Analysis1

    7/44

    7

    Problems

    Scans text character by character

    Look ahead character determines what

    kind of token to read and when thecurrent token ends

    First character cannot determine whatkind of token we are going to read

  • 8/2/2019 Lexical Analysis1

    8/44

    8

    Symbol Table Stores information for subsequent phases

    Interface to the symbol table Insert(s,t): save lexeme s and token t and return pointer Lookup(s): return index of entry for lexeme s or 0 if s is

    not found

    Implementation of symbol table Fixed amount of space to store lexemes. Not

    advisable as it waste space.

    Store lexemes in a separate array. Each lexeme isseparated by eos. Symbol table has pointers tolexemes.

  • 8/2/2019 Lexical Analysis1

    9/44

    9

    Fixed space for lexemes Other attributes

    Usually 32 bytes

    lexeme1 lexeme2eos eos lexeme3

    Other attributesUsually 4 bytes

  • 8/2/2019 Lexical Analysis1

    10/44

    10

    How to handle keywords?

    Consider token DIV and MOD with lexemesdiv and mod.

    Initialize symbol table with insert( div ,DIV ) and insert( mod , MOD).

    Any subsequent lookup returns a nonzero

    value, therefore, cannot be used asidentifier.

  • 8/2/2019 Lexical Analysis1

    11/44

    11

    Difficulties in design of lexicalanalyzers

    Is it as simple as it sounds?

    Lexemes in a fixed position. Fix format vs. freeformat languages

    Handling of blanks in Pascal blanks separate identifiers

    in Fortran blanks are important only in literal strings forexample variable counter is same as count er

    Another exampleDO 10 I = 1.25 DO10I=1.25

    DO 10 I = 1,25 DO10I=1,25

  • 8/2/2019 Lexical Analysis1

    12/44

    12

    The first line is variable assignmentDO10I=1.25

    second line is beginning of aDo loop

    Reading from left to right one can notdistinguish between the two until the ; or . isreached

    Fortran white space and fixed format rulescame into force due to punch cards anderrors in punching

  • 8/2/2019 Lexical Analysis1

    13/44

    13

  • 8/2/2019 Lexical Analysis1

    14/44

    14

  • 8/2/2019 Lexical Analysis1

    15/44

    15

    PL/1 Problems Keywords are not reserved in PL/1

    if then then then = else else else = thenif if then then = then + 1

    PL/1 declarations Declare(arg1,arg2,arg3,.,argn)

    Can not tell whether Declare is a keyword or arrayreference until after )

    Requires arbitrary lookahead and very largebuffers. Worse, the buffers may have to bereloaded.

  • 8/2/2019 Lexical Analysis1

    16/44

    16

    Problem continues even today!!

    C++ template syntax: Foo

    C++ stream syntax: cin >> var;

    Nested templates:

    Foo

    Can these problems be resolved by lexicalanalyzers alone?

  • 8/2/2019 Lexical Analysis1

    17/44

    17

    How to specify tokens?

    How to describe tokens2.e0 20.e-01 2.000

    How to break text into token

    if (x==0) a = x

  • 8/2/2019 Lexical Analysis1

    18/44

    18

    How to describe tokens?

    Programming language tokens can bedescribed by regular languages

    Regular languages Are easy to understand There is a well understood and useful theory

    They have efficient implementation

    Regular languages have been discussed ingreat detail in the Theory of Computationcourse

  • 8/2/2019 Lexical Analysis1

    19/44

    19

    Operations on languages

    L U M = {s | s is in L or s is in M}

    LM = {st | s is in L and t is in M}

    L* = Union of Li such that 0 i Where L0 = and Li = L i-1 L

  • 8/2/2019 Lexical Analysis1

    20/44

    20

    Example Let L = {a, b, .., z} and D = {0, 1, 2, 9} then

    LUD is set of letters and digits

    LD is set of strings consisting of a letter followed

    by a digit

    L* is a set of all strings of letters including

    L(LUD)* is set of all strings of letters and digitsbeginning with a letter

    D+ is the set of strings of one or more digits

  • 8/2/2019 Lexical Analysis1

    21/44

    21

    Notation Let be a set of characters. A language

    over is a set of strings of charactersbelonging to

    A regular expression r denotes a language

    L(r)

    Rules that define the regular expressionsover

    is a regular expression that denotes {} theset containing the empty string If a is a symbol in then a is a regular

    expression that denotes {a}

  • 8/2/2019 Lexical Analysis1

    22/44

    22

    If r and s are regular expressions denotingthe languages L(r) and L(s) then

    (r)|(s) is a regular expression denoting L(r)

    U L(s) (r)(s) is a regular expression denoting

    L(r)L(s)

    (r)* is a regular expression denoting (L(r))*

    (r) is a regular expression denoting L(r)

  • 8/2/2019 Lexical Analysis1

    23/44

    23

    Let = {a, b}

    The regular expression a|b denotes the set {a, b}

    The regular expression (a|b)(a|b) denotes {aa, ab,ba, bb}

    The regular expression a* denotes the set of allstrings {, a, aa, aaa, }

    The regular expression (a|b)* denotes the set ofall strings containing and all strings of as and bs

    The regular expression a|a*b denotes the setcontaining the string a and all strings consisting ofzero or more as followed by a b

  • 8/2/2019 Lexical Analysis1

    24/44

    24

    Precedence and associativity

    *, concatenation, and | are left associative * has the highest precedence

    Concatenation has the second highestprecedence

    | has the lowest precedence

  • 8/2/2019 Lexical Analysis1

    25/44

    25

    How to specify tokens Regular definitions

    Let ri be a regular expression and di be adistinct name

    Regular definition is a sequence of definitionsof the formd1 r1d2 r2

    ..dn rn

    Where each ri is a regular expression over U

    {d1, d2, , di-1}

  • 8/2/2019 Lexical Analysis1

    26/44

    26

    Examples My fax number

    91-(512)-259-7586

    = digits U {-, (, ) }

    Country

    digit+

    Area ( digit+ )

    Exchange digit+

    Phone digit+

    Number

    country - area - exchange - phone

    digit

    2

    digit3

    digit3

    digit4

  • 8/2/2019 Lexical Analysis1

    27/44

    27

    Examples

    My email [email protected]

    = letter U {@, . }

    Letter a| b| | z| A| B| | Z

    Name letter+

    Address name @ name . name . name

  • 8/2/2019 Lexical Analysis1

    28/44

    28

    Examples Identifier

    letter a| b| |z| A| B| | Zdigit 0| 1| | 9identifier letter(letter|digit)*

    Unsigned number in Pascaldigit 0| 1| |9digits digit+

    fraction . digits | exponent (E ( + | - | ) digits) | number digits fraction exponent

  • 8/2/2019 Lexical Analysis1

    29/44

    29

    Regular expressions in specifications

    Regular expressions describe many useful

    languages

    Regular expressions are only specifications;implementation is still required

    Given a string s and a regular expression R,does s L(R) ?

    Solution to this problem is the basis of the lexical

    analyzers

    However, just the yes/no answer is not important

    Goal: Partition the input into tokens

  • 8/2/2019 Lexical Analysis1

    30/44

    30

    1. Construct R matching all lexemes of all tokens R = R1 + R2 + R3 + ..

    1. Let input be x1xn for 1 i n check x1xi L(R)

    1. x1xi L(R) x1xi L(Rj) for some j

    2. Write a regular expression for lexemes of eachtoken

    number digit+

    identifier letter(letter|digit)+

    smallest such j is token class of x1xi

    1. Remove x1xi from input; go to (3)

  • 8/2/2019 Lexical Analysis1

    31/44

    31

    The algorithm gives priority to tokens listedearlier Treats if as keyword and not identifier

    How much input is used? What if

    x1xi L(R) x1xj L(R)

    Pick up the longest possible string in L(R) The principle of maximal munch

    Regular expressions provide a concise and usefulnotation for string patterns

    Good algorithms require single pass over the input

  • 8/2/2019 Lexical Analysis1

    32/44

    32

    How to break up text

    Elsex=0

    Regular expressions alone are not enough

    Normally longest match wins

    Ties are resolved by prioritizing tokens

    Lexical definitions consist of regular

    definitions, priority rules and maximali i

    else x = 0 elsex = 0

  • 8/2/2019 Lexical Analysis1

    33/44

    33

    Finite Automata Regular expression are declarative specifications

    Finite automata is implementation A finite automata consists of

    An input alphabet belonging to A set of states S A set of transitions statei statej

    A set of final states F A start state n

    Transition s1 s2 is read:in state s1 on input a go to state s2

    If end of input is reached in a final state then accept

    Otherwise, reject

    input

    a

  • 8/2/2019 Lexical Analysis1

    34/44

    34

    Pictorial notation A state

    A final state

    Transition

    Transition from state i to state j oninput a i j

    a

  • 8/2/2019 Lexical Analysis1

    35/44

    35

    How to recognize tokens

    Considerrelop < | = | >

    id letter(letter|digit)*

    num digit+

    (. digit+

    )? (E(+|-)?digit+)?

    delim blank | tab | newline

    ws delim+

    Construct an analyzer that will return

    pairs

  • 8/2/2019 Lexical Analysis1

    36/44

    36

    Transition diagram for relops

    >=

    other

    token is relop, lexeme is >=

    token is relop, lexeme is >*

    >

    =

    =

    =

    other

    other

    *

    *

    token is relop, lexeme is >=

    token is relop, lexeme is >

    token is relop, lexeme is