nlexical

Upload: surendra-tripathi

Post on 06-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 nlexical

    1/46

    Lexical Analysis

    Textbook:Modern Compiler DesignChapter 2.1

  • 8/3/2019 nlexical

    2/46

    A motivating example

    Create a program that counts the number of lines ina given input text file

  • 8/3/2019 nlexical

    3/46

    Solution

    int num_lines = 0;

    %%

    \n ++num_lines;

    . ;

    %%

    main()

    {

    yylex();printf( "# of lines = %d\n", num_lines);

    }

  • 8/3/2019 nlexical

    4/46

    Solution

    int num_lines = 0;

    %%

    \n ++num_lines;

    . ;

    %%

    main()

    {

    yylex();printf( "# of lines = %d\n", num_lines);

    }

    initial

    ;

    newline

  • 8/3/2019 nlexical

    5/46

  • 8/3/2019 nlexical

    6/46

    Basic Compiler Phases

    Source program (string)

    Fin. Assembly

    lexical analysis

    syntax analysis

    semantic analysis

    Tokens

    Abstract syntax tree

    Fr

    ont-End

    Back-End

    Annotated Abstract syntax tree

  • 8/3/2019 nlexical

    7/46

    Example Tokens

    Type Examples

    ID foo n_14 last

    NUM 73 00 517 082

    REAL 66.1 .5 10. 1e67 5.5e-10

    IF if

    COMMA ,

    NOTEQ !=

    LPAREN (

    RPAREN )

  • 8/3/2019 nlexical

    8/46

    Example NonTokens

    Type Examples

    comment /* ignored */

    preprocessor directive #include

    #define NUMS 5, 6

    macro NUMS

    whitespace \t \n \b

  • 8/3/2019 nlexical

    9/46

    Example

    void match0(char *s) /* find a zero */

    {

    if (!strncmp(s, 0.0, 3))

    return 0. ;

    }

    VOID ID(match0) LPAREN CHAR DEREF ID(s)

    RPAREN LBRACE IF LPAREN NOT ID(strncmp)

    LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3)

    RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE

    EOF

  • 8/3/2019 nlexical

    10/46

    input

    program text (file)

    output

    sequence of tokens Read input file

    Identify language keywords and standard identifiers

    Handle include files and macros

    Count line numbers Remove whitespaces

    Report illegal symbols

    Produce symbol table

    Lexical Analysis (Scanning)

  • 8/3/2019 nlexical

    11/46

    Simplifies the syntax analysis

    And language definition

    Modularity

    Reusability Efficiency

    Why Lexical Analysis

  • 8/3/2019 nlexical

    12/46

    What is a token? Defined by the programming language

    Can be separated by spaces

    Smallest units

    Defined by regular expressions

  • 8/3/2019 nlexical

    13/46

    A simplified scanner for CToken nextToken()

    {char c ;

    loop: c = getchar();

    switch (c){

    case ` `:goto loop ;

    case `;`: return SemiColumn;case `+`: c = ungetc() ;

    switch (c) {

    case `+': return PlusPlus ;

    case '= return PlusEqual;default: ungetc(c);

    return Plus; }

    case `

  • 8/3/2019 nlexical

    14/46

    Regular Expressions

  • 8/3/2019 nlexical

    15/46

    Escape characters in regular expressions

    \ converts a single operator into text

    a\+

    (a\+\*)+

    Double quotes surround text

    a+*+

    Esthetically ugly

    But standard

  • 8/3/2019 nlexical

    16/46

    Regular Descriptions EBNF where non-terminals are fully defined before first

    use

    letterp[a-zA-Z]digit p[0-9]underscore p_letter_or_digit p letter|digitunderscored_tail p underscore letter_or_digit+identifierp letter letter_or_digit* underscored_tail

    token description

    A token name

    A regular expression

  • 8/3/2019 nlexical

    17/46

    The Lexical Analysis Problem Given

    A set of token descriptions

    An input string

    Partition the strings into tokens(class, value)

    Ambiguity resolution The longest matching token

    Between two equal length tokens select the first

  • 8/3/2019 nlexical

    18/46

    A Flex specification of C ScannerLetter [a-zA-Z_]Digit [0-9]

    %%

    [ \t] {;}

    [\n] {line_count++;}

    ; { return SemiColumn;}

    ++ { return PlusPlus ;}

    += { return PlusEqual ;}

    + { return Plus}

    while { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}

  • 8/3/2019 nlexical

    19/46

    Flex Input

    regular expressions and actions (C code)

    Output

    A scanner program that reads the input and

    applies actions when input regular expression ismatched

    flex

    regular expressions

    input program tokensscanner

  • 8/3/2019 nlexical

    20/46

  • 8/3/2019 nlexical

    21/46

    Automatic Creation of Efficient Scanners

    Nave approach on regular expressions

    (dotted items)

    Construct non deterministic finite

    automaton over items

    Convert to a deterministic

    Minimize the resultant automaton

    Optimize (compress) representation

  • 8/3/2019 nlexical

    22/46

    Dotted Items

  • 8/3/2019 nlexical

    23/46

    Example T p a+ b+

    Input aab

    After parsing aa

    T p a+ b+

  • 8/3/2019 nlexical

    24/46

    Item Types Shift item

    In front of a basic pattern

    A p (ab)+ c (de|fe)*

    Reduce item

    At the end of rhs

    A p (ab)+ c (de|fe)* Basic item

    Shift or reduce items

  • 8/3/2019 nlexical

    25/46

    Character Moves For shift items character moves are simple

    Tp E

    cF

    c

    Tp E

    c F

    c

    Digit p [0-9] 7 T p [0-9] 7

  • 8/3/2019 nlexical

    26/46

    IMoves

    For non-shift items the situation is more

    complicated

    What character do we need to see?

    Where are we in the matching?

    T p a*

    T p (a*)

  • 8/3/2019 nlexical

    27/46

    IMoves for Repetitions

    Where can we get from T p E (R)* F

    If R occurs zero times

    T p E (R)* F

    If R occurs one or more times

    T p E ( R)* F

    When R ends E ( R )* F

    E (R)* F

    E ( R)* F

    I1

  • 8/3/2019 nlexical

    28/46

    Slide 27

    I1 IBM, 11/11/2003

  • 8/3/2019 nlexical

    29/46

    I Moves

    I2

  • 8/3/2019 nlexical

    30/46

    Slide 28

    I2 IBM, 11/11/2003

  • 8/3/2019 nlexical

    31/46

  • 8/3/2019 nlexical

    32/46

    ConcurrentS

    earch How to scan multiple token classes in a

    single run?

  • 8/3/2019 nlexical

    33/46

    I p [0-9]+

    F p[0-9]*.[0-9]+

    Input 3.1;

    I p ([0-9])+ F p ([0-9])*.([0-9])+

    I p ([0-9])+ F p ([0-9])*.([0-9])+ F p ([0-9])* .([0-9])+

    I p ( [0-9] )+ F p ( [0-9] )*.([0-9])+

    I p ([0-9])+ I p ( [0-9])+ F p ([0-9])*.([0-9])+ F p ( [0-9])* .([0-9])+

    F p ( [0-9])*. ([0-9])+

  • 8/3/2019 nlexical

    34/46

    The Need forB

    acktracking A simple minded solution may require

    unbounded backtracking

    T1 p a+;T2 p a

    Quadratic behavior

    Does not occur in practice

    A linear solution exists

  • 8/3/2019 nlexical

    35/46

    A Non-Deterministic Finite State Machine

    Add a production S p T1 | T2 | | Tn

    Construct NDFA over the items

    Initial state S p (T1 | T2 | | Tn)

    For every character move, construct a character

    transition

    T p E cF

    For every I move construct an I transition The accepting states are the reduce items

    Accept the language defined by Ti

  • 8/3/2019 nlexical

    36/46

    I Moves

    I3

  • 8/3/2019 nlexical

    37/46

    Slide 34

    I3 IBM, 11/11/2003

  • 8/3/2019 nlexical

    38/46

    I p [0-9]+

    F p[0-9]*.[0-9]+ Sp(I|F)

    Ip ([0-9]+)

    Fp ([0-9]*).[0-9]+

    Ip ([0-9])+

    Ip ( [0-9])+

    [0-9]

    Ip ( [0-9])+

    Fp ([0-9]*).[0-9]+ Fp ( [0-9]*) .[0-9]+

    Fp ( [0-9] *).[0-9]+

    [0-9]

    Fp [0-9]*. ([0-9]+)

    Fp [0-9]*. ([0-9]+)Fp [0-9]*. ( [0-9] +)

    Fp [0-9]*. ( [0-9] +)

    .

    [0-9]

  • 8/3/2019 nlexical

    39/46

    EfficientS

    canners Construct Deterministic Finite Automaton

    Every state is a set of items

    Every transition is followed by an I-closure

    When a set contains two reduce items select the onedeclared first

    Minimize the resultant automaton Rejecting states are initially indistinguishable

    Accepting states of the same token are indistinguishable Exponential worst case complexity

    Does not occur in practice

    Compress representation

  • 8/3/2019 nlexical

    40/46

    I p [0-9]+

    F p[0-9]*.[0-9]+

    Sp(I|F)

    Ip ([0-9]+)

    Ip ([0-9])+

    Fp ([0-9]*).[0-9]+

    Fp ([0-9]*). [0-9]+

    Fp ([0-9]*) . [0-9]+

    Ip ( [0-9])+

    Fp ( [0-9] *).[0-9]+

    Ip ( [0-9])+

    Ip ([0-9])+Fp ([0-9]*).[0-9]+

    Fp ( [0-9]*) .[0-9]+

    [0-9]

    Sink

    [^0-9.].|\n

    [0-9]

    Fp [0-9] *. ([0-9]+)Fp [0-9]*.([0-9]+)

    .

    .

    Fp [0-9] *. ([0-9] +)

    Fp [0-9]*.([0-9]+)

    Fp [0-9]*.( [0-9]+)

    [0-9]

    [0-9]

    [^0-9]

    [^0-9]

    [^0-9.]

  • 8/3/2019 nlexical

    41/46

    A Linear-Time Lexical Analyzer

    Procedure Get_Next_Token;

    set Start of token to Read Index;

    set End of last token to uninitialized

    set Class of last token to uninitialized

    set State to Initial

    while state /=S

    ink:Set ch to Input Char[Read Index];

    Set state = H[state, ch];

    if accepting(state):

    set Class of last token to Class(state);

    set End of last token to Read Index

    set Read Index to Read Index + 1;

    set token .class to Class of last token;

    set token .repr to char[Start of token .. End last token];

    set Read index to End last token + 1;

    IMPORT Input Char [1..];

    Set Read Index To 1;

  • 8/3/2019 nlexical

    42/46

    S

    canning 3.1;input state next

    state

    last

    token

    q3.1; 1 2 I

    3 q.1; 2 3 I

    3. q1; 3 4 F

    3.1 q; 4 Sink F

    1 Sink

    [^0-9.]

    2

    [0-9]

    [0-9]

    3

    ..

    [^0-9.]

    4

    [0-9]

    [0-9]

    [^0-9]I

    F

  • 8/3/2019 nlexical

    43/46

    S

    canning aaa

    input state next

    state

    last

    token

    qaaa$ 1 2 T1

    a q aa$ 2 4 T1

    a a q a$ 4 4 T1

    a a a q $ 4 Sink T1

    1T1 p a+;

    T2 pa

    Sink[^a]

    [.\n]

    2

    [a]

    [a]

    3

    ;

    [^a;]

    T1

    T1

    [.\n]

    4

    ;

    [a]

    [^a;]

  • 8/3/2019 nlexical

    44/46

    Error Handling Illegal symbols

    Common errors

  • 8/3/2019 nlexical

    45/46

  • 8/3/2019 nlexical

    46/46

    Summary

    For most programming languages lexical

    analyzers can be easily constructed

    automatically

    Exceptions:

    Fortran

    PL/1

    Lex/Flex/Jlex are useful beyond compilers